[go: up one dir, main page]

WO2024163481A1 - A method and an apparatus for encoding/decoding at least one part of an image using multi-level context model - Google Patents

A method and an apparatus for encoding/decoding at least one part of an image using multi-level context model Download PDF

Info

Publication number
WO2024163481A1
WO2024163481A1 PCT/US2024/013557 US2024013557W WO2024163481A1 WO 2024163481 A1 WO2024163481 A1 WO 2024163481A1 US 2024013557 W US2024013557 W US 2024013557W WO 2024163481 A1 WO2024163481 A1 WO 2024163481A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
context
entropy
hyperprior
encoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2024/013557
Other languages
French (fr)
Inventor
Syed Mateen UL HAQ
Fabien Racape
Hyomin CHOI
Wei Jiang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
InterDigital VC Holdings Inc
Original Assignee
InterDigital VC Holdings Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by InterDigital VC Holdings Inc filed Critical InterDigital VC Holdings Inc
Priority to EP24710246.0A priority Critical patent/EP4659447A1/en
Priority to CN202480007920.8A priority patent/CN120642331A/en
Publication of WO2024163481A1 publication Critical patent/WO2024163481A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/13Adaptive entropy coding, e.g. adaptive variable length coding [AVLC] or context adaptive binary arithmetic coding [CABAC]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/91Entropy coding, e.g. variable length coding [VLC] or arithmetic coding

Definitions

  • JVET Joint Video Exploration Team
  • ISO/MPEG Joint Video Exploration Team
  • ITU International Telecommunication Union
  • JVET Joint Video Exploration Team
  • NN-based algorithms that are used to enhance or speed-up an encoder of an existing codec.
  • any existing standard can be used.
  • Methods that are built on top of existing standards replaces one or more modules of existing state-of-the-art codecs with NN-based methods, e.g., post-filters, prediction modules, etc.
  • NN-based codecs are completed disrupted from traditional compression schemes that include prediction, transform, quantization and entropy coding modules. Contrary to traditional methods which apply pre-defined prediction modes and transforms, NN- based methods rely on many parameters that are learned on a large dataset during a training stage, by iteratively minimizing a loss function. In the case of image and video compression, the loss function is defined by a rate-distortion cost, where the rate stands for an estimation of a bitrate of an encoded bitstream, and the distortion quantifies a quality of a decoded video against an original input.
  • At least one of the present embodiments generally relates to a method or an apparatus in the context of the compression of images and videos using neural networks. At least one of the present embodiments generally relates to multi-resolution context prediction for encoding or decoding a latent representative of an image or a video. Some embodiments relate to a method for encoding a tensor or latent representative of at least one part of an image, the encoding comprising multi-resolution context prediction for determining entropy parameters.
  • Some embodiments relate to a method for decoding a latent representative of at least one part of an image, the decoding comprising multi-resolution context prediction for determining entropy parameters.
  • the methods comprise for at least one sample of a first tensor representative of at least one part of an image, obtaining a first context from at least one or more samples of at least one second tensor, the at least second tensor being obtained from at least one down-sampling of the first tensor, determining at least one entropy parameter based on the first context, entropy encoding or decoding the first tensor using the determined at least one entropy parameter.
  • an apparatus comprises a processor.
  • the processor can be configured to implement the general aspects by executing any of the described methods.
  • a device comprising an apparatus configured to implement the general aspects by executing any of the described embodiments; and at least one of (i) an antenna configured to receive a signal, the signal including a video or an image, (ii) a band limiter configured to limit the received signal to a band of frequencies that includes the video or image, or (iii) a display configured to display an output representative of the video or image.
  • a signal carrying coded data representative of at least one part of an image or a video encoded according to any one of the embodiments described herein.
  • FIG.1 illustrates a block diagram of a system within which aspects of the present embodiments may be implemented.
  • FIG. 2 illustrates a block diagram of an embodiment of an auto-encoder for image or video compression.
  • FIG.3 illustrates a block diagram of an embodiment of a training of an auto-encoder for image or video compression.
  • FIG. 4 illustrates a block diagram of an embodiment of an auto-encoder with a hyperprior architecture.
  • FIG. 5 illustrates a block diagram of an embodiment of an auto-encoder with a hyperprior architecture and context prediction.
  • FIG. 6 illustrates an example of a method for predicting a context at different resolutions of a tensor, according to an embodiment.
  • FIG. 7 illustrates an example of a method for encoding a tensor using a context predicted at different resolutions according to an embodiment.
  • FIG.8 illustrates an example of effective masking positions for decoding a particular sample of a tensor to reconstruct, according to an embodiment.
  • FIG.9 illustrates an example of a method for encoding a latent representative of at least one part of an image according to an embodiment.
  • FIG.10 illustrates an example of a method for decoding a latent representative of at least one part of an image according to an embodiment.
  • FIG. 11 illustrates an example of a method for encoding a tensor using a context predicted at different resolutions according to another embodiment.
  • FIG.12 illustrates a block diagram of an auto-encoder with a hyperprior architecture and context prediction at different resolution, according to an embodiment.
  • FIG. 13 illustrates a block diagram of an embodiment of an auto-encoder with a hyperprior architecture and context prediction at different resolution, according to another embodiment.
  • FIG.14 illustrates one embodiment of an apparatus for encoding or decoding an image or a video according to any one of the embodiments described herein.
  • FIG.15 shows two remote devices communicating over a communication network in accordance with an example of present principles.
  • FIG.16 shows the syntax of a signal in accordance with an example of present principles.
  • DETAILED DESCRIPTION This application describes a variety of aspects, including tools, features, embodiments, models, approaches, etc. Many of these aspects are described with specificity and, at least to show the individual characteristics, are often described in a manner that may sound limiting. However, this is for purposes of clarity in description, and does not limit the application or scope of those aspects. Indeed, all of the different aspects can be combined and interchanged to provide further aspects. Moreover, the aspects can be combined and interchanged with aspects described in earlier filings as well. The aspects described and contemplated in this application can be implemented in many different forms. FIGs.
  • At least one of the aspects generally relates to image or video encoding and decoding, and at least one other aspect generally relates to transmitting a bitstream generated or encoded.
  • These and other aspects can be implemented as a method, an apparatus, a computer readable storage medium having stored thereon instructions for encoding or decoding image or video data according to any one of the methods described, and/or a computer readable storage medium having stored thereon a bitstream generated according to any one of the methods described.
  • FIG. 1 illustrates a block diagram of an example of a system in which various aspects and embodiments can be implemented.
  • System 100 may be embodied as a device including the various components described below and is configured to perform one or more of the aspects described in this application. Examples of such devices, include, but are not limited to, various electronic devices such as personal computers, laptop computers, smartphones, tablet computers, digital multimedia set top boxes, digital television receivers, personal video recording systems, connected home appliances, and servers. Elements of system 100, singly or in combination, may be embodied in a single integrated circuit, multiple ICs, and/or discrete components. For example, in at least one embodiment, the processing and encoder/decoder elements of system 100 are distributed across multiple ICs and/or discrete components.
  • the system 100 is communicatively coupled to other systems, or to other electronic devices, via, for example, a communications bus or through dedicated input and/or output ports.
  • the system 100 is configured to implement one or more of the aspects described in this application.
  • the system 100 includes at least one processor 110 configured to execute instructions loaded therein for implementing, for example, the various aspects described in this application.
  • Processor 110 may include embedded memory, input output interface, and various other circuitries as known in the art.
  • the system 100 includes at least one memory 120 (e.g., a volatile memory device, and/or a non-volatile memory device).
  • System 100 includes a storage device 140, which may include non-volatile memory and/or volatile memory, including, but not limited to, EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive.
  • the storage device 140 may include an internal storage device, an attached storage device, and/or a network accessible storage device, as non-limiting examples.
  • System 100 includes an encoder/decoder module 130 configured, for example, to process data to provide an encoded video or decoded video, and the encoder/decoder module 130 may include its own processor and memory.
  • the encoder/decoder module 130 represents module(s) that may be included in a device to perform the encoding and/or decoding functions.
  • a device may include one or both of the encoding and decoding modules.
  • encoder/decoder module 130 may be implemented as a separate element of system 100 or may be incorporated within processor 110 as a combination of hardware and software as known to those skilled in the art.
  • the encoder/decoder module 130 is a NN-based auto-encoder, e.g. an auto-encoder or a variational auto-encoder described in relation with FIG. 2-4, and implements one or more embodiments transform block as further described below.
  • the transform block described in the embodiments is called multi-resolution transform block for clarity, other wording can be used without limiting the scope of the embodiments described herein.
  • Program code to be loaded onto processor 110 or encoder/decoder 130 to perform the various aspects described in this application may be stored in storage device 140 and subsequently loaded onto memory 120 for execution by processor 110.
  • one or more of processor 110, memory 120, storage device 140, and encoder/decoder module 130 may store one or more of various items during the performance of the processes described in this application. Such stored items may include, but are not limited to, the input video, the decoded video or portions of the decoded video, the bitstream, matrices, variables, and intermediate or final results from the processing of equations, formulas, operations, and operational logic.
  • memory inside of the processor 110 and/or the encoder/decoder module 130 is used to store instructions and to provide working memory for processing that is needed during encoding or decoding.
  • a memory external to the processing device (for example, the processing device can be either the processor 110 or the encoder/decoder module 130) is used for one or more of these functions.
  • the external memory can be the memory 120 and/or the storage device 140, for example, a dynamic volatile memory and/or a non-volatile flash memory.
  • an external non-volatile flash memory is used to store the operating system of, for example, a television.
  • a fast external dynamic volatile memory such as a RAM is used as working memory for video coding and decoding operations.
  • the input to the elements of system 100 can be provided through various input devices as indicated in block 105.
  • Such input devices include, but are not limited to, (i) a radio frequency (RF) portion that receives an RF signal transmitted, for example, over the air by a broadcaster, (ii) a Component (COMP) input terminal (or a set of COMP input terminals), (iii) a Universal Serial Bus (USB) input terminal, and/or (iv) a High Definition Multimedia Interface (HDMI) input terminal.
  • RF radio frequency
  • COMP Component
  • USB Universal Serial Bus
  • HDMI High Definition Multimedia Interface
  • the input devices of block 105 have associated respective input processing elements as known in the art.
  • the RF portion may be associated with elements suitable for (i) selecting a desired frequency (also referred to as selecting a signal, or band-limiting a signal to a band of frequencies), (ii) down converting the selected signal, (iii) band- limiting again to a narrower band of frequencies to select (for example) a signal frequency band which may be referred to as a channel in certain embodiments, (iv) demodulating the down converted and band-limited signal, (v) performing error correction, and (vi) demultiplexing to select the desired stream of data packets.
  • the RF portion of various embodiments includes one or more elements to perform these functions, for example, frequency selectors, signal selectors, band- limiters, channel selectors, filters, downconverters, demodulators, error correctors, and demultiplexers.
  • the RF portion may include a tuner that performs various of these functions, including, for example, down converting the received signal to a lower frequency (for example, an intermediate frequency or a near-baseband frequency) or to baseband.
  • the RF portion and its associated input processing element receives an RF signal transmitted over a wired (for example, cable) medium, and performs frequency selection by filtering, down converting, and filtering again to a desired frequency band.
  • Adding elements may include inserting elements in between existing elements, for example, inserting amplifiers and an analog- to-digital converter.
  • the RF portion includes an antenna.
  • the USB and/or HDMI terminals may include respective interface processors for connecting system 100 to other electronic devices across USB and/or HDMI connections. It is to be understood that various aspects of input processing, for example, Reed-Solomon error correction, may be implemented, for example, within a separate input processing IC or within processor 110 as necessary. Similarly, aspects of USB or HDMI interface processing may be implemented within separate interface ICs or within processor 110 as necessary.
  • the demodulated, error corrected, and demultiplexed stream is provided to various processing elements, including, for example, processor 110, and encoder/decoder 130 operating in combination with the memory and storage elements to process the data-stream as necessary for presentation on an output device.
  • Various elements of system 100 may be provided within an integrated housing, Within the integrated housing, the various elements may be interconnected and transmit data therebetween using suitable connection arrangement 115, for example, an internal bus as known in the art, including the Inter-IC (I2C) bus, wiring, and printed circuit boards.
  • the system 100 includes communication interface 150 that enables communication with other devices via communication channel 190.
  • the communication interface 150 may include, but is not limited to, a transceiver configured to transmit and to receive data over communication channel 190.
  • the communication interface 150 may include, but is not limited to, a modem or network card and the communication channel 190 may be implemented, for example, within a wired and/or a wireless medium.
  • Data is streamed, or otherwise provided, to the system 100, in various embodiments, using a wireless network such as a Wi-Fi network, for example IEEE 802.11 (IEEE refers to the Institute of Electrical and Electronics Engineers).
  • IEEE 802.11 IEEE refers to the Institute of Electrical and Electronics Engineers.
  • the Wi-Fi signal of these embodiments is received over the communications channel 190 and the communications interface 150 which are adapted for Wi- Fi communications.
  • the communications channel 190 of these embodiments is typically connected to an access point or router that provides access to outside networks including the Internet for allowing streaming applications and other over-the-top communications.
  • inventions provide streamed data to the system 100 using a set-top box that delivers the data over the HDMI connection of the input block 105. Still other embodiments provide streamed data to the system 100 using the RF connection of the input block 105. As indicated above, various embodiments provide data in a non-streaming manner. Additionally, various embodiments use wireless networks other than Wi-Fi, for example a cellular network or a Bluetooth network.
  • the system 100 can provide an output signal to various output devices, including a display 165, speakers 175, and other peripheral devices 185.
  • the display 165 of various embodiments includes one or more of, for example, a touchscreen display, an organic light-emitting diode (OLED) display, a curved display, and/or a foldable display.
  • OLED organic light-emitting diode
  • the display 165 can be for a television, a tablet, a laptop, a cell phone (mobile phone), or other device.
  • the display 1100 can also be integrated with other components (for example, as in a smart phone), or separate (for example, an external monitor for a laptop).
  • the other peripheral devices 185 include, in various examples of embodiments, one or more of a stand-alone digital video disc (or digital versatile disc) (DVR, for both terms), a disk player, a stereo system, and/or a lighting system.
  • DVR digital versatile disc
  • Various embodiments use one or more peripheral devices 185 that provide a function based on the output of the system 100. For example, a disk player performs the function of playing the output of the system 100.
  • control signals are communicated between the system 100 and the display 165, speakers 175, or other peripheral devices 185 using signaling such as AV.Link, Consumer Electronics Control (CEC), or other communications protocols that enable device-to-device control with or without user intervention.
  • the output devices can be communicatively coupled to system 100 via dedicated connections through respective interfaces 160, 170, and 180. Alternatively, the output devices can be connected to system 100 using the communications channel 190 via the communications interface 150.
  • the display 165 and speakers 175 can be integrated in a single unit with the other components of system 100 in an electronic device such as, for example, a television.
  • the display interface 160 includes a display driver, such as, for example, a timing controller (T Con) chip.
  • the display 165 and speaker 175 may alternatively be separate from one or more of the other components, for example, if the RF portion of input 105 is part of a separate set-top box.
  • the output signal may be provided via dedicated output connections, including, for example, HDMI ports, USB ports, or COMP outputs.
  • the embodiments can be carried out by computer software implemented by the processor 110 or by hardware, or by a combination of hardware and software. As a non-limiting example, the embodiments can be implemented by one or more integrated circuits.
  • the memory 120 can be of any type appropriate to the technical environment and can be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory, and removable memory, as non-limiting examples.
  • the processor 110 can be of any type appropriate to the technical environment, and can encompass one or more of microprocessors, general purpose computers, special purpose computers, digital signal processors (DSPs), processors based on a single or on a multi-core architecture, sequential or parallel architectures, specialized circuits such as Field Programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry, as non-limiting examples.
  • DSPs digital signal processors
  • FPGA Field Programmable gate arrays
  • ASIC application specific circuits
  • FIG. 2 illustrates an embodiment of an end-to-end compression system wherein one or more embodiments described below can be implemented.
  • the input ⁇ to the encoder part of the network can consists of an image or frame of a video, or a part of an image, or a tensor representing a group of images, or a tensor representing a part (crop) of a group of images.
  • the input can have one or multiple components, e.g.: monochrome, RGB or YCbCr components.
  • the input x has 3 components of size HxW respectively.
  • the input ⁇ is fed into the encoder network ⁇ ⁇ , also known as analysis transform.
  • the analysis transform ⁇ ⁇ is usually a sequence of convolutional layers (Conv in FIG. 2) with activation functions (Activation in FIG. 2).
  • the convolutions can include a mechanism to spatially down- sample the input, for instance selecting a convolution with a stride of 2 in both vertical and horizontal directions would result in an output having half the size of the input in both dimensions.
  • the output of a convolution is a tensor of shape CxHxW, where H and W are the spatial height and width, respectively, and C corresponds to an adjustable number of channels.
  • a first convolution of ⁇ ⁇ takes as input a tensor of 3 channels which correspond to the color components.
  • This encoder network can be seen as a learned transform, that is lossy as there are generally fewer elements in the output latent tensor than the source 3xHxW input.
  • the output of the analysis is called a latent representation or a tensor of latent variables.
  • a set of latent variables constructs a latent space, which is also frequently used in the context of neural network- based end-to-end compression.
  • the output ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ is quantized, resulting in a tensor ⁇ which is then entropy coded into a binary stream (bitstream) for storage or transmission.
  • bitstream binary stream
  • the bitstream is entropy decoded (ED) to obtain ⁇ .
  • the decoder network ⁇ ⁇ also called synthesis transform, generates the reconstructed input: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , which is an approximation of the original ⁇ from the quantized latent representation ⁇ .
  • the synthesis transform ⁇ ⁇ is usually a sequence of up-sampling convolutions, e.g., transpose convolutions or convolutions followed by up-sampling filters.
  • the decoder network can be seen as a learned inverse transform, or a denoising and generative transform. The performance of a compression system is measured as a tradeoff between the number of bits needed to transmit versus the quality of the decoded content.
  • a compression model can be trained using a loss following the Lagrangian form ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , where R represents the rate or bitrate and D the distortion of the decoded content.
  • FIG. 2 and FIG. 3 show autoencoders in an actual inference configuration that consists of an encoder producing a bitstream that is then transmitted and decoded.
  • the entropy of ⁇ is determined with respect to a learned probability model ⁇ ⁇ , as depicted in FIG.3.
  • the distortion in existing approaches, such NNs are trained using several types of losses that can be used alone or in combination.
  • a loss based on an “objective” metric typically Mean Squared Error (MSE), noted ⁇ ⁇ ⁇ ⁇ ⁇ in FIG. 3 can be used or for instance based on structural similarity (SSIM). The as good as the second type, but the fidelity to the original signal (image) is higher.
  • Loss based on “subjective” can also be used, typically using Generative Adversarial Networks (GANs) during the training stage or advanced visual metric via a proxy NN.
  • GANs Generative Adversarial Networks
  • FIG.4 depicts an approach called auto-encoder with a hyperprior which is further described in [Ballé] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” ArXiv180201436 Cs Eess Math, May 2018, Accessed: Aug. 25, 2020. [Online].
  • the model now includes additional convolutional sequences h ⁇ and h ⁇ that output a learned distribution parameters for each element of the latent ⁇ respectively in a latent ⁇ and ⁇ .
  • the learned distribution parameters are the scales or means and scales of Gaussian or Laplace distributions, for each element of the latent ⁇ .
  • the tensor ⁇ output by h ⁇ needs to be encoded and transmitted as side information for the decoder to decode ⁇ .
  • the tensor ⁇ is thus quantized to the tensor ⁇ and entropy encoded (EE).
  • the bitstream representing the latent ⁇ of the quantized learned parameters is entropy decoded (ED) and fed into the synthesis transform of the distribution information h ⁇ .
  • ED entropy decoded
  • the encoder contains parts of the decoder to generate the exact same metadata that the decoder will decode to process the rest of the bitstream.
  • the synthesis h ⁇ must perform bit exact operations between the encoder and the decoder for the system to work. A slight difference in the generated parameters would completely crash the arithmetic decoder for ⁇ .
  • an autoregressive context model is described in [Minnen] D. Minnen, J. Ballé, and G. D.
  • FIG. 5 An example of a Cx5x5 kernel is shown at the bottom left of FIG. 5.
  • a context is obtained for the black center location using the gray samples of the tensor.
  • the context is output for the location (x,y) of the tensor ⁇ , as a Cx1x1 vector that is for example a weighted combination of the causal samples covered by the kernel.
  • entropy parameters ep
  • ep entropy parameters network which outputs the tensor of entropy parameters ⁇ , ⁇ for location (x,y).
  • An example of the entropy parameter network is illustrated on FIG. 5 as a sequence of 1x1 convolution layers and activation layers.
  • the entropy parameter network allows to transform the input of a 3xCx1x1 vector resulting from the concatenation of the context with the hyperprior into a 2xCx1x1 vectors comprising the ⁇ , ⁇ for the location (x,y) of the tensor.
  • the same operations for determining the entropy parameters ⁇ , ⁇ for each location (x,y) of the tensor are performed on the encoder and decoder parts.
  • FIG. 6 illustrates an example of a method for predicting a context at different resolutions of a tensor, according to an embodiment.
  • An input tensor y is downscaled (601, 602) at different spatial resolutions (y0, y1 and y2 on FIG.6, with y0 being the input tensor at the input resolution).
  • a context prediction For a location (x,y) of a tensor, and for each level y 0 , y 1 and y 2 , a context prediction, respectively CP00, CP01 and CP02 is applied to the tensor of that level.
  • Each context prediction outputs a Cx1x1 tensor representing the context for the location (x,y) at the considered level, C being the number of channels of the input tensor.
  • the output of the context prediction is upscaled (603, 604) to the next level and combined (605, 606) with the next level, up to the input resolution to provide the final context (ctx on FIG.6) to be used for entropy encoding the location (x,y) of the input tensor y.
  • the output of the context prediction CP 02 is upscaled (603) to the next upper resolution and combined (605) with the output of the context prediction CP01.
  • the result of the combination is then upscaled (604) to the input resolution and combined (606) with the output of the context prediction CP 00 .
  • each output of a context prediction can be upscaled directly to the input resolution and combined with the out of the context prediction CP00.
  • the same context prediction is done at the encoder and decoder sides so that the stream is correctly entropy decoded.
  • three levels have been defined, but the mechanism for context prediction can be applied to more or less levels, for instance, only two levels, or more than three levels.
  • only the causal parts of the downscaled tensor with respect to the location (x,y) currently being encoded/decoded can be used for determining the context prediction, for instance using the kernel illustrated on FIG.5.
  • the context predictions performed at the lower levels can take advantage of the full kernels.
  • the downscaled tensors are entropy encoded and transmitted to the decoder, in that case, the full decoded downscaled tensors can be used in the context prediction for entropy decoding a tensor of a higher level.
  • the decoder side there is no need to downscale the tensor y as the downscaled tensors are entropy decoded by the decoder and thus available to the decoder.
  • FIG. 7 illustrates an embodiment of the context prediction determined using multi-resolution tensors as described in FIG.
  • these tensors are compressed, serialized, and transmitted in reverse order yk, ..., y0 in pairs with compressed and serialized representations of their side information derived tensors ⁇ ⁇ , ... ⁇ ⁇ , and they are decoded in this same order.
  • the entropy parameters ⁇ ⁇ , ⁇ ⁇ for the current level tensor ⁇ ⁇ may be predicted from any or all of (i) a causal context of the level tensor ⁇ ⁇ , (ii) the full context of any previously decoded level tensors ⁇ ⁇ , ...
  • each yj is fed into a "context prediction" network CPlj.
  • the resulting outputs are then upscaled to match the spatial dimensions, and then aggregated via a reduction operation (visualized by "+”) such as summation or concatenation.
  • the aggregated result is fed into an entropy parameters network EP l which also takes the side-information ⁇ ⁇ as input, and outputs the entropy parameters ⁇ ⁇ , ⁇ ⁇ .
  • These entropy parameters represent the Gaussian distributions ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ with which yl is entropy coded to produce a bitstream with rate Ryl along with the rate bitstream associated with the side information tensor zl.
  • FIG.8 is a diagram visualizing the effective mask positions for decoding a particular pixel of y 0 in the embodiment of FIG.7. In order to maintain causality, CP00 must be masked.
  • FIG.9 illustrates an example of a method 900 for encoding a latent representative of at least one part of an image according to an embodiment.
  • the method comprises at least the following steps for at least one sample of a first tensor representative of the at least one part of the image.
  • a first context is obtained from at least one or more samples of at least one second tensor, the at least second tensor being obtained from at least one down-sampling of the first tensor.
  • the first context is obtained as described above, the at least one second context is then up-sampled to an upper resolution, for instance the resolution of the first tensor or the resolution of a next tensor level.
  • the first context can also be determined using one or more causal samples of the first tensor, that is samples at the resolution of the input tensor.
  • the context obtained from the one or more causal samples of the first tensor is combined with the up-sampled second context. Combining can be addition or aggregation of the contexts.
  • at least one entropy parameter is determined based on the first context.
  • the determining of the entropy parameters can also use the side-information derived for the first tensor.
  • the side-information is an hyperprior learned for the first tensor.
  • the method further comprises the encoding the hyperprior learned for the first tensor.
  • the first tensor is entropy encoded using the determined at least one entropy parameter.
  • the at least one second tensor is also encoded so that all kernel samples can be used when determining the context prediction based on its level.
  • the entropy encoding of the at least one second tensor uses the same mechanisms of context prediction as for the input tensor in a recursive manner, as described with FIG.7 for example.
  • FIG.10 illustrates an example of a method 100 for decoding the latent representative of the at least one part of the image according to an embodiment.
  • the method comprises at least the following steps for at least one sample of a first tensor representative of the at least one part of the image.
  • a first context is obtained from at least one or more samples of at least one second tensor, the at least second tensor being representative of the first tensor at a lower resolution.
  • at least one entropy parameter is determined based on the first context. For correct entropy decoding of the first tensor, the first context and entropy parameters are obtained in a similar manner as in the encoding side.
  • the first tensor is entropy encoded using the determined at least one entropy parameter.
  • the lower resolutions of the first tensor are entropy decoded before determining the first context.
  • the above methods are used in a neural-network-based encoder and decoder respectively, such as an auto-encoder.
  • Embodiments described herein provide for recursive level decoding. Similar to the recursive multi- level strategy used with wavelet transform-based compression, the pyramidal approach described herein allows each level of the pyramid to refer to a smaller version of itself while decoding. This can lead to savings in rate since a "lower resolution" version of a level is often a very good predictor of the "higher resolution" elements within current level. Described embodiments also provide a larger available spatial context window during reconstruction.
  • Information about a larger "global" context can be used to improve the prediction of the entropy parameters ⁇ ⁇ , ⁇ ⁇ to better match ⁇ ⁇ , thus producing savings in rate.
  • "future" context is fully available at a lower resolution.
  • the information about "future" decoded pixels is already available at a lower resolution.
  • the context is limited to anything spatially above or to the left of the currently being decoded pixel.
  • some embodiments of the method described herein also provides information about "future" decoded pixels spatially below or to the right of the currently being decoded pixel.
  • Described embodiments provide potentially similar RD performance as heavier methods, at a lower computational cost.
  • the context captured is quite similar to the costlier autoregression methods, except perhaps for any high resolution more-like patterns within the tensor, which are fairly rare.
  • any or no form of self-autoregression for reconstruction of ⁇ ⁇ can be used.
  • Any level ⁇ ⁇ may be reconstructed all at once, with raster scan autoregression, cross- context channel modeling (for example as in [Ma]), checkerboard autoregression as described in He et al.: D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y. Wang, “ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding.” arXiv, Mar. 29, 2022. doi: 10.48550/arXiv.2203.10886, or any other such method.
  • Various levels need not use the same reconstruction method.
  • a gating function ⁇ ⁇ can be used on ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • One may include a trainable masking operation such as a convolution (or multiplication by sigmoid-activated attention map) in order to allow the model to selectively choose what parts of what is important at various levels. This is visualized in FIG.11 with the example of convolution operations (Conv on FIG.11) that are applied before the down-sampling operations. If any part of the tensor is deemed too costly to reconstruct precisely at any given level, the gating function may output a constant for that portion of the tensor.
  • each level ⁇ ⁇ may be associated with a distortion loss term ⁇ ⁇ , ⁇ ⁇ ⁇ where ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ; ⁇ is the corresponding lower resolution image that is decoded from ⁇ ⁇ .
  • FIG.12 illustrates a block diagram of an auto-encoder with a hyperprior architecture and context prediction at different resolutions, according to an embodiment.
  • the modules of the auto-encoder such as the analysis function ga , synthesis function gs, hyperprior networks h ⁇ ⁇ , h ⁇ ⁇ , h ⁇ ⁇ , h ⁇ ⁇ , EE, Q , and ED have similar functions as the corresponding modules of FIG.4 or 5.
  • the latent y is downscaled to a spatial resolution than the spatial resolution of the latent y.
  • Both the latent y and its downscaled version are quantized to produce quantized latents ⁇ ⁇ and ⁇ ⁇ for each level 0 and 1.
  • a hyperprior network is used for learning the side-information ⁇ ⁇ for encoding the latent ⁇ ⁇ .
  • a quantized latent ⁇ ⁇ representative of the side- information ⁇ ⁇ is encoded, for example using a fully factorized prior model.
  • context prediction cp11 is applied using causal samples of the latent ⁇ ⁇ .
  • the location (x,y) is illustrated as the central block sample in the 5x5 kernel and samples that are available (here the causal samples) are illustrated by the grey samples.
  • the context output by the context prediction module cp11 is concatenated with the side-information ⁇ ⁇ .
  • the resulting concatenated vector is fed to the entropy parameter network ep 1 to output the entropy parameters ⁇ ⁇ , ⁇ ⁇ used for entropy encoding the location (x,y) of ⁇ ⁇ .
  • latent ⁇ ⁇ is reconstructed by entropy decoding the hyperprior latent ⁇ ⁇ and transforming the hyperprior latent ⁇ ⁇ to the distribution parameters ⁇ ⁇ .
  • a spatial location (x,y) of the latent ⁇ ⁇ is entropy decoded using the context output by the context prediction module cp11 which considers only the causal samples of ⁇ ⁇ (that is the samples that have been already decoded).
  • the context is concatenated with the reconstructed side-information ⁇ ⁇ and fed to the entropy parameter network ep1 to output the entropy parameters ⁇ ⁇ , ⁇ ⁇ used for entropy decoding the location (x,y) of ⁇ ⁇ .
  • a hyperprior network is used for learning the side- information ⁇ ⁇ for encoding the latent ⁇ ⁇ .
  • a quantized latent ⁇ ⁇ representative of the side- information ⁇ ⁇ is encoded, for example using a fully factorized prior model.
  • context prediction cp00 For each spatial location (x,y) of the latent ⁇ ⁇ , context prediction cp00 is determining using causal samples of the latent ⁇ ⁇ (using the grey samples in the kernel illustrated on FIG.12). Context prediction cp01 is also determined using full samples of the latent ⁇ ⁇ (as illustrated on FIG.12 all samples on the kernel used are shown in grey as it is considered that at the decoder side ⁇ ⁇ is already reconstructed). The context output by the context prediction module cp 01 is upsampled to the spatial resolution of the latent ⁇ ⁇ and combined/added to the context output by the context prediction module cp 00 . The resulting context is concatenated with the side-information ⁇ ⁇ .
  • the resulting concatenated vector is fed to the entropy parameter network ep0 to output the entropy parameters ⁇ ⁇ , ⁇ ⁇ used for entropy encoding the location (x,y) of ⁇ ⁇ .
  • the latent ⁇ ⁇ is reconstructed by entropy decoding the hyperprior latent ⁇ ⁇ and transforming the hyperprior latent ⁇ ⁇ to the distribution parameters ⁇ ⁇ .
  • a spatial location (x,y) of the latent ⁇ ⁇ is entropy decoded using the entropy parameters ⁇ ⁇ , ⁇ ⁇ that are determined in a similar manner as on the encoder side.
  • Context prediction cp00 is determined using causal samples of the latent ⁇ ⁇ .
  • Context prediction cp01 is also determined using full samples of the decoded latent ⁇ ⁇ .
  • the context output by the context prediction module cp01 is upsampled to the spatial resolution of the latent ⁇ ⁇ and combined/added to the context output by the context prediction module cp00.
  • the resulting context is concatenated with the side-information ⁇ ⁇ .
  • the resulting concatenated vector is fed to the entropy parameter network ep0.
  • FIG. 13 illustrates a block diagram of an embodiment of an auto-encoder with a hyperprior architecture and context prediction at different resolutions, according to another embodiment.
  • the embodiment illustrated on FIG. 13 differs from the embodiment of FIG.12 in that the side- information ⁇ ⁇ used for entropy encoding the tensor of the lower levels ( ⁇ ⁇ ) is not learned but derived from the side information ⁇ ⁇ learned for the tensor at the input resolution ⁇ ⁇ .
  • the side information ⁇ ⁇ is obtained by downsampling the side-information ⁇ ⁇ .
  • the determined by the context prediction module cp 11 is concatenated with the derived side-information ⁇ ⁇ to be fed into to the entropy parameter network ep1 for entropy encoding the tensor ⁇ ⁇ .
  • the parts of the scheme illustrated in FIG.13 is similar to the corresponding modules of FIG.12.
  • the side-information ⁇ ⁇ is derived by downsampling the reconstructed side information ⁇ ⁇ .
  • the apparatus comprises a processor 1410 and can be interconnected to a memory 1420 through at least one port. Both the processor 1410 and the memory 1420 can also have one or more additional interconnections to external connections.
  • the processor 1410 is also configured to either insert or receive information in a bitstream and, either compressing, encoding, or decoding using program code instructions implementing the aforementioned methods when executed by a processor.
  • the program code to be loaded onto processor 1410 to perform the various aspects described in this application may be stored in a storage device and subsequently loaded onto memory 1420 for execution by processor 1410.
  • the memory 1420 can be of any type appropriate to the technical environment and can be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory, and removable memory, as non- limiting examples.
  • the processor 1410 can be of any type appropriate to the technical environment, and can encompass one or more of microprocessors, general purpose computers, special purpose computers, digital signal processors (DSPs), processors based on a single core architecture or on a multi-core architecture, sequential or parallel architectures, specialized circuits such as Field Programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry, as non-limiting examples.
  • DSPs digital signal processors
  • FPGA Field Programmable gate arrays
  • ASIC application specific circuits
  • the device A comprises a processor in relation with memory RAM and ROM which are configured to implement a method for encoding an image or a video as described using the aforementioned methods and the device B comprises a processor in relation with memory RAM and ROM which are configured to implement a method for decoding an image or a video as described using the aforementioned methods.
  • the network is a broadcast network, adapted to broadcast/transmit encoded image or video from device A to decoding devices including the device B.
  • a signal intended to be transmitted by the device A, carries at least one bitstream comprising coded data representative of a tensor encoding an image or a video encoded according to the methods as explained above.
  • the signal also comprises one or more of the following: coded data representative of one or more downscaled versions of the tensor, coded data representative of side information learned for the tensor, or coded data representative of side information derived for the one or more downscaled versions of the tensor.
  • FIG.16 shows an example of the syntax of such a signal when the coded data is transmitted over a packet-based transmission protocol.
  • Each transmitted packet P comprises a header H and a payload PAYLOAD.
  • the payload comprises coded data encoded according to any one of the embodiments described above.
  • Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc. may be used in various embodiments to modify an element, component, step, operation, etc., such as, for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an ordering to the modified operations unless specifically required.
  • the first decoding need not be performed before the second decoding, and may occur, for example, before, during, or in an overlapping time period with the second decoding.
  • Various methods and other aspects described in this application can be used as additional modules or to modify modules of an image or video auto-encoder which are based on neural-network as shown in FIG.2-5. Unless indicated otherwise, or technically precluded, the aspects described in this application can be used individually or in combination.
  • Various numeric values are used in the present application. The specific values are for example purposes and the aspects described are not limited to these specific values.
  • Various implementations involve decoding.
  • “Decoding”, as used in this application, can encompass all or part of the processes performed, for example, on a received encoded sequence in order to produce a final output suitable for display.
  • such processes include one or more of the processes typically performed by a decoder, for example, entropy decoding, inverse quantization, inverse transformation, and differential decoding.
  • such processes also, or alternatively, include processes performed by a decoder of various implementations described in this application.
  • “decoding” refers only to entropy decoding
  • “decoding” refers only to differential decoding
  • “decoding” refers to a combination of entropy decoding and differential decoding.
  • decoding process is intended to refer specifically to a subset of operations or generally to the broader decoding process will be clear based on the context of the specific descriptions and is believed to be well understood by those skilled in the art.
  • Various implementations involve encoding.
  • encoding as used in this application can encompass all or part of the processes performed, for example, on an input video sequence in order to produce an encoded bitstream.
  • processes include one or more of the processes typically performed by an encoder, for example, partitioning, differential encoding, transformation, quantization, and entropy encoding.
  • such processes also, or alternatively, include processes performed by an encoder of various implementations described in this application.
  • encoding refers only to entropy encoding
  • encoding refers only to differential encoding
  • encoding refers to a combination of differential encoding and entropy encoding.
  • This information can be packaged or arranged in a variety of manners, including for example manners common in video standards such as putting the information into an SPS, a PPS, a NAL unit, a header (for example, a NAL unit header, or a slice header), or an SEI message.
  • Other manners are also available, including for example manners common for system level or application level standards such as putting the information into one or more of the following: a. SDP (session description protocol), a format for describing multimedia communication sessions for the purposes of session announcement and session invitation, for example as described in RFCs and used in conjunction with RTP (Real-time Transport Protocol) transmission.
  • SDP session description protocol
  • RTP Real-time Transport Protocol
  • DASH MPD Media Presentation Description
  • a Descriptor is associated to a Representation or collection of Representations to provide additional characteristic to the content Representation.
  • RTP header extensions for example as used during RTP streaming.
  • ISO Base Media File Format for example as used in OMAF and using boxes which are object-oriented building blocks defined by a unique type identifier and length also known as 'atoms' in some specifications.
  • HLS HTTP live Streaming
  • a manifest can be associated, for example, to a version or collection of versions of a content to provide characteristics of the version or collection of versions.
  • FIG. 1 When a figure is presented as a flow diagram, it should be understood that it also provides a block diagram of a corresponding apparatus. Similarly, when a figure is presented as a block diagram, it should be understood that it also provides a flow diagram of a corresponding method/process.
  • the implementations and aspects described herein can be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed can also be implemented in other forms (for example, an apparatus or program).
  • An apparatus can be implemented in, for example, appropriate hardware, software, and firmware.
  • a processor which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device.
  • Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.
  • PDAs portable/personal digital assistants
  • this application may refer to “determining” various pieces of information. Determining the information can include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory. Further, this application may refer to “accessing” various pieces of information. Accessing the information can include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information.
  • this application may refer to “receiving” various pieces of information.
  • Receiving is, as with “accessing”, intended to be a broad term.
  • Receiving the information can include one or more of, for example, accessing the information, or retrieving the information (for example, from memory).
  • “receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
  • any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B).
  • such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C).
  • This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.
  • the word “signal” refers to, among other things, indicating something to a corresponding decoder.
  • the same parameter is used at both the encoder side and the decoder side.
  • an encoder can transmit (explicit signaling) a particular parameter to the decoder so that the decoder can use the same particular parameter.
  • signaling can be used without transmitting (implicit signaling) to simply allow the decoder to know and select the particular parameter. By avoiding transmission of any actual functions, a bit savings is realized in various embodiments. It is to be appreciated that signaling can be accomplished in a variety of ways.
  • one or more syntax elements, flags, and so forth are used to signal information to a corresponding decoder in various embodiments. While the preceding relates to the verb form of the word “signal”, the word “signal” can also be used herein as a noun.
  • implementations can produce a variety of signals formatted to carry information that can be, for example, stored or transmitted.
  • the information can include, for example, instructions for performing a method, or data produced by one of the described implementations.
  • a signal can be formatted to carry the bitstream of a described embodiment.
  • Such a signal can be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal.
  • the formatting can include, for example, encoding a data stream and modulating a carrier with the encoded data stream.
  • the information that the signal carries can be, for example, analog or digital information.
  • the signal can be transmitted over a variety of different wired or wireless links, as is known.
  • the signal can be stored on a processor-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

Methods and apparatuses for encoding/decoding at least one part of an image using context prediction obtained from one or more lower spatial resolutions of a first tensor are disclosed, wherein for at least one sample of the first tensor representative of the at least one part of an image, a first context is obtained from at least one or more samples of at least one second tensor, the at least second tensor being obtained from at least one down-sampling of the first tensor. At least one entropy parameter is determined based on the first context and the first tensor is entropy encoded or decoded using the determined at least one entropy parameter.

Description

A METHOD AND AN APPARATUS FOR ENCODING/DECODING AT LEAST ONE PART OF AN IMAGE USING MULTI-LEVEL CONTEXT MODEL CROSS REFERENCE TO RELATED APPLICATIONS This application claims the priority to US Patent Application No.63/442,506, filed on 1st February 2023, which is incorporated herein by reference in its entirety. TECHNICAL FIELD At least one of the present embodiments generally relates to a method or an apparatus for compression of images and videos using Neural Network (NN) based tools. BACKGROUND In recent years, deep Neural Networks have been developed to surpass the compression performance of traditional codecs. For instance, the Joint Video Exploration Team (JVET) between ISO/MPEG and ITU is currently studying such tools to replace some modules of the latest standard H.266/VVC, as well as the replacement of the whole structure by end-to-end auto-encoder methods. More precisely, different approaches can be distinguished. Purely encoder methods can be seen as NN-based algorithms that are used to enhance or speed-up an encoder of an existing codec. In that case, there is no normative change, any existing standard can be used. Methods that are built on top of existing standards replaces one or more modules of existing state-of-the-art codecs with NN-based methods, e.g., post-filters, prediction modules, etc. End-to-end NN-based codecs are completed disrupted from traditional compression schemes that include prediction, transform, quantization and entropy coding modules. Contrary to traditional methods which apply pre-defined prediction modes and transforms, NN- based methods rely on many parameters that are learned on a large dataset during a training stage, by iteratively minimizing a loss function. In the case of image and video compression, the loss function is defined by a rate-distortion cost, where the rate stands for an estimation of a bitrate of an encoded bitstream, and the distortion quantifies a quality of a decoded video against an original input. Traditionally, the quality of the decoded input image is optimized, for example, based on the measure of the mean squared error or an approximation of the human-perceived visual quality. SUMMARY At least one of the present embodiments generally relates to a method or an apparatus in the context of the compression of images and videos using neural networks. At least one of the present embodiments generally relates to multi-resolution context prediction for encoding or decoding a latent representative of an image or a video. Some embodiments relate to a method for encoding a tensor or latent representative of at least one part of an image, the encoding comprising multi-resolution context prediction for determining entropy parameters. Some embodiments relate to a method for decoding a latent representative of at least one part of an image, the decoding comprising multi-resolution context prediction for determining entropy parameters. In some embodiments, the methods comprise for at least one sample of a first tensor representative of at least one part of an image, obtaining a first context from at least one or more samples of at least one second tensor, the at least second tensor being obtained from at least one down-sampling of the first tensor, determining at least one entropy parameter based on the first context, entropy encoding or decoding the first tensor using the determined at least one entropy parameter. According to another aspect, there is provided an apparatus. The apparatus comprises a processor. The processor can be configured to implement the general aspects by executing any of the described methods. According to another general aspect of at least one embodiment, there is provided a device comprising an apparatus configured to implement the general aspects by executing any of the described embodiments; and at least one of (i) an antenna configured to receive a signal, the signal including a video or an image, (ii) a band limiter configured to limit the received signal to a band of frequencies that includes the video or image, or (iii) a display configured to display an output representative of the video or image. According to another aspect, there is provided a signal carrying coded data representative of at least one part of an image or a video encoded according to any one of the embodiments described herein. According to another general aspect of at least one embodiment, there is provided a computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the embodiments or variants. These and other aspects, features and advantages of the general aspects will become apparent from the following detailed description of exemplary embodiments, which is to be read in connection with the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS FIG.1 illustrates a block diagram of a system within which aspects of the present embodiments may be implemented. FIG. 2 illustrates a block diagram of an embodiment of an auto-encoder for image or video compression. FIG.3 illustrates a block diagram of an embodiment of a training of an auto-encoder for image or video compression. FIG. 4 illustrates a block diagram of an embodiment of an auto-encoder with a hyperprior architecture. FIG. 5 illustrates a block diagram of an embodiment of an auto-encoder with a hyperprior architecture and context prediction. FIG. 6 illustrates an example of a method for predicting a context at different resolutions of a tensor, according to an embodiment. FIG. 7 illustrates an example of a method for encoding a tensor using a context predicted at different resolutions according to an embodiment. FIG.8 illustrates an example of effective masking positions for decoding a particular sample of a tensor to reconstruct, according to an embodiment. FIG.9 illustrates an example of a method for encoding a latent representative of at least one part of an image according to an embodiment. FIG.10 illustrates an example of a method for decoding a latent representative of at least one part of an image according to an embodiment. FIG. 11 illustrates an example of a method for encoding a tensor using a context predicted at different resolutions according to another embodiment. FIG.12 illustrates a block diagram of an auto-encoder with a hyperprior architecture and context prediction at different resolution, according to an embodiment. FIG. 13 illustrates a block diagram of an embodiment of an auto-encoder with a hyperprior architecture and context prediction at different resolution, according to another embodiment. FIG.14 illustrates one embodiment of an apparatus for encoding or decoding an image or a video according to any one of the embodiments described herein. FIG.15 shows two remote devices communicating over a communication network in accordance with an example of present principles. FIG.16 shows the syntax of a signal in accordance with an example of present principles. DETAILED DESCRIPTION This application describes a variety of aspects, including tools, features, embodiments, models, approaches, etc. Many of these aspects are described with specificity and, at least to show the individual characteristics, are often described in a manner that may sound limiting. However, this is for purposes of clarity in description, and does not limit the application or scope of those aspects. Indeed, all of the different aspects can be combined and interchanged to provide further aspects. Moreover, the aspects can be combined and interchanged with aspects described in earlier filings as well. The aspects described and contemplated in this application can be implemented in many different forms. FIGs. 1-17 below provide some embodiments, but other embodiments are contemplated and the discussion of FIGs.1-17 does not limit the breadth of the implementations. At least one of the aspects generally relates to image or video encoding and decoding, and at least one other aspect generally relates to transmitting a bitstream generated or encoded. These and other aspects can be implemented as a method, an apparatus, a computer readable storage medium having stored thereon instructions for encoding or decoding image or video data according to any one of the methods described, and/or a computer readable storage medium having stored thereon a bitstream generated according to any one of the methods described. FIG. 1 illustrates a block diagram of an example of a system in which various aspects and embodiments can be implemented. System 100 may be embodied as a device including the various components described below and is configured to perform one or more of the aspects described in this application. Examples of such devices, include, but are not limited to, various electronic devices such as personal computers, laptop computers, smartphones, tablet computers, digital multimedia set top boxes, digital television receivers, personal video recording systems, connected home appliances, and servers. Elements of system 100, singly or in combination, may be embodied in a single integrated circuit, multiple ICs, and/or discrete components. For example, in at least one embodiment, the processing and encoder/decoder elements of system 100 are distributed across multiple ICs and/or discrete components. In various embodiments, the system 100 is communicatively coupled to other systems, or to other electronic devices, via, for example, a communications bus or through dedicated input and/or output ports. In various embodiments, the system 100 is configured to implement one or more of the aspects described in this application. The system 100 includes at least one processor 110 configured to execute instructions loaded therein for implementing, for example, the various aspects described in this application. Processor 110 may include embedded memory, input output interface, and various other circuitries as known in the art. The system 100 includes at least one memory 120 (e.g., a volatile memory device, and/or a non-volatile memory device). System 100 includes a storage device 140, which may include non-volatile memory and/or volatile memory, including, but not limited to, EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive. The storage device 140 may include an internal storage device, an attached storage device, and/or a network accessible storage device, as non-limiting examples. System 100 includes an encoder/decoder module 130 configured, for example, to process data to provide an encoded video or decoded video, and the encoder/decoder module 130 may include its own processor and memory. The encoder/decoder module 130 represents module(s) that may be included in a device to perform the encoding and/or decoding functions. As is known, a device may include one or both of the encoding and decoding modules. Additionally, encoder/decoder module 130 may be implemented as a separate element of system 100 or may be incorporated within processor 110 as a combination of hardware and software as known to those skilled in the art. In some embodiments, the encoder/decoder module 130 is a NN-based auto-encoder, e.g. an auto-encoder or a variational auto-encoder described in relation with FIG. 2-4, and implements one or more embodiments transform block as further described below. In the present document, the transform block described in the embodiments is called multi-resolution transform block for clarity, other wording can be used without limiting the scope of the embodiments described herein. Program code to be loaded onto processor 110 or encoder/decoder 130 to perform the various aspects described in this application may be stored in storage device 140 and subsequently loaded onto memory 120 for execution by processor 110. In accordance with various embodiments, one or more of processor 110, memory 120, storage device 140, and encoder/decoder module 130 may store one or more of various items during the performance of the processes described in this application. Such stored items may include, but are not limited to, the input video, the decoded video or portions of the decoded video, the bitstream, matrices, variables, and intermediate or final results from the processing of equations, formulas, operations, and operational logic. In some embodiments, memory inside of the processor 110 and/or the encoder/decoder module 130 is used to store instructions and to provide working memory for processing that is needed during encoding or decoding. In other embodiments, however, a memory external to the processing device (for example, the processing device can be either the processor 110 or the encoder/decoder module 130) is used for one or more of these functions. The external memory can be the memory 120 and/or the storage device 140, for example, a dynamic volatile memory and/or a non-volatile flash memory. In several embodiments, an external non-volatile flash memory is used to store the operating system of, for example, a television. In at least one embodiment, a fast external dynamic volatile memory such as a RAM is used as working memory for video coding and decoding operations. The input to the elements of system 100 can be provided through various input devices as indicated in block 105. Such input devices include, but are not limited to, (i) a radio frequency (RF) portion that receives an RF signal transmitted, for example, over the air by a broadcaster, (ii) a Component (COMP) input terminal (or a set of COMP input terminals), (iii) a Universal Serial Bus (USB) input terminal, and/or (iv) a High Definition Multimedia Interface (HDMI) input terminal. Other examples, not shown in FIG.1, include composite video. In various embodiments, the input devices of block 105 have associated respective input processing elements as known in the art. For example, the RF portion may be associated with elements suitable for (i) selecting a desired frequency (also referred to as selecting a signal, or band-limiting a signal to a band of frequencies), (ii) down converting the selected signal, (iii) band- limiting again to a narrower band of frequencies to select (for example) a signal frequency band which may be referred to as a channel in certain embodiments, (iv) demodulating the down converted and band-limited signal, (v) performing error correction, and (vi) demultiplexing to select the desired stream of data packets. The RF portion of various embodiments includes one or more elements to perform these functions, for example, frequency selectors, signal selectors, band- limiters, channel selectors, filters, downconverters, demodulators, error correctors, and demultiplexers. The RF portion may include a tuner that performs various of these functions, including, for example, down converting the received signal to a lower frequency (for example, an intermediate frequency or a near-baseband frequency) or to baseband. In one set-top box embodiment, the RF portion and its associated input processing element receives an RF signal transmitted over a wired (for example, cable) medium, and performs frequency selection by filtering, down converting, and filtering again to a desired frequency band. Various embodiments rearrange the order of the above-described (and other) elements, remove some of these elements, and/or add other elements performing similar or different functions. Adding elements may include inserting elements in between existing elements, for example, inserting amplifiers and an analog- to-digital converter. In various embodiments, the RF portion includes an antenna. Additionally, the USB and/or HDMI terminals may include respective interface processors for connecting system 100 to other electronic devices across USB and/or HDMI connections. It is to be understood that various aspects of input processing, for example, Reed-Solomon error correction, may be implemented, for example, within a separate input processing IC or within processor 110 as necessary. Similarly, aspects of USB or HDMI interface processing may be implemented within separate interface ICs or within processor 110 as necessary. The demodulated, error corrected, and demultiplexed stream is provided to various processing elements, including, for example, processor 110, and encoder/decoder 130 operating in combination with the memory and storage elements to process the data-stream as necessary for presentation on an output device. Various elements of system 100 may be provided within an integrated housing, Within the integrated housing, the various elements may be interconnected and transmit data therebetween using suitable connection arrangement 115, for example, an internal bus as known in the art, including the Inter-IC (I2C) bus, wiring, and printed circuit boards. The system 100 includes communication interface 150 that enables communication with other devices via communication channel 190. The communication interface 150 may include, but is not limited to, a transceiver configured to transmit and to receive data over communication channel 190. The communication interface 150 may include, but is not limited to, a modem or network card and the communication channel 190 may be implemented, for example, within a wired and/or a wireless medium. Data is streamed, or otherwise provided, to the system 100, in various embodiments, using a wireless network such as a Wi-Fi network, for example IEEE 802.11 (IEEE refers to the Institute of Electrical and Electronics Engineers). The Wi-Fi signal of these embodiments is received over the communications channel 190 and the communications interface 150 which are adapted for Wi- Fi communications. The communications channel 190 of these embodiments is typically connected to an access point or router that provides access to outside networks including the Internet for allowing streaming applications and other over-the-top communications. Other embodiments provide streamed data to the system 100 using a set-top box that delivers the data over the HDMI connection of the input block 105. Still other embodiments provide streamed data to the system 100 using the RF connection of the input block 105. As indicated above, various embodiments provide data in a non-streaming manner. Additionally, various embodiments use wireless networks other than Wi-Fi, for example a cellular network or a Bluetooth network. The system 100 can provide an output signal to various output devices, including a display 165, speakers 175, and other peripheral devices 185. The display 165 of various embodiments includes one or more of, for example, a touchscreen display, an organic light-emitting diode (OLED) display, a curved display, and/or a foldable display. The display 165 can be for a television, a tablet, a laptop, a cell phone (mobile phone), or other device. The display 1100 can also be integrated with other components (for example, as in a smart phone), or separate (for example, an external monitor for a laptop). The other peripheral devices 185 include, in various examples of embodiments, one or more of a stand-alone digital video disc (or digital versatile disc) (DVR, for both terms), a disk player, a stereo system, and/or a lighting system. Various embodiments use one or more peripheral devices 185 that provide a function based on the output of the system 100. For example, a disk player performs the function of playing the output of the system 100. In various embodiments, control signals are communicated between the system 100 and the display 165, speakers 175, or other peripheral devices 185 using signaling such as AV.Link, Consumer Electronics Control (CEC), or other communications protocols that enable device-to-device control with or without user intervention. The output devices can be communicatively coupled to system 100 via dedicated connections through respective interfaces 160, 170, and 180. Alternatively, the output devices can be connected to system 100 using the communications channel 190 via the communications interface 150. The display 165 and speakers 175 can be integrated in a single unit with the other components of system 100 in an electronic device such as, for example, a television. In various embodiments, the display interface 160 includes a display driver, such as, for example, a timing controller (T Con) chip. The display 165 and speaker 175 may alternatively be separate from one or more of the other components, for example, if the RF portion of input 105 is part of a separate set-top box. In various embodiments in which the display 165 and speakers 175 are external components, the output signal may be provided via dedicated output connections, including, for example, HDMI ports, USB ports, or COMP outputs. The embodiments can be carried out by computer software implemented by the processor 110 or by hardware, or by a combination of hardware and software. As a non-limiting example, the embodiments can be implemented by one or more integrated circuits. The memory 120 can be of any type appropriate to the technical environment and can be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory, and removable memory, as non-limiting examples. The processor 110 can be of any type appropriate to the technical environment, and can encompass one or more of microprocessors, general purpose computers, special purpose computers, digital signal processors (DSPs), processors based on a single or on a multi-core architecture, sequential or parallel architectures, specialized circuits such as Field Programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry, as non-limiting examples. In the present application, the terms “reconstructed” and “decoded” may be used interchangeably, the terms “encoded” or “coded” may be used interchangeably, the terms “pixel” or “sample” may be used interchangeably, and the terms “image,” “picture” and “frame” may be used interchangeably. Usually, but not necessarily, the term “reconstructed” is used at the encoder side while “decoded” is used at the decoder side. FIG. 2 illustrates an embodiment of an end-to-end compression system wherein one or more embodiments described below can be implemented. The input ^^ to the encoder part of the network can consists of an image or frame of a video, or a part of an image, or a tensor representing a group of images, or a tensor representing a part (crop) of a group of images. In each case, the input can have one or multiple components, e.g.: monochrome, RGB or YCbCr components. In the example of FIG.2, the input x has 3 components of size HxW respectively. The input ^^ is fed into the encoder network ^^^, also known as analysis transform. The analysis transform ^^^ is usually a sequence of convolutional layers (Conv in FIG. 2) with activation functions (Activation in FIG. 2). The convolutions can include a mechanism to spatially down- sample the input, for instance selecting a convolution with a stride of 2 in both vertical and horizontal directions would result in an output having half the size of the input in both dimensions. The output of a convolution is a tensor of shape CxHxW, where H and W are the spatial height and width, respectively, and C corresponds to an adjustable number of channels. For an RGB image for instance, a first convolution of ^^^ takes as input a tensor of 3 channels which correspond to the color components. This encoder network can be seen as a learned transform, that is lossy as there are generally fewer elements in the output latent tensor than the source 3xHxW input. The output of the analysis, mostly in the form of a 3-way array, referred to as a 3-D tensor, is called a latent representation or a tensor of latent variables. From a broader perspective, a set of latent variables constructs a latent space, which is also frequently used in the context of neural network- based end-to-end compression. The output ^^ ൌ ^^^ ^ ^^^ is quantized, resulting in a tensor ^^^ which is then entropy coded into a binary stream (bitstream) for storage or transmission. At the decoder, the bitstream is entropy decoded (ED) to obtain ^^^. The decoder network ^^^, also called synthesis transform, generates the reconstructed input: ^^^ ൌ ^^^ ^ ^^^^ , which is an approximation of the original ^^ from the quantized latent representation ^^^ . The synthesis transform ^^^ is usually a sequence of up-sampling convolutions, e.g., transpose convolutions or convolutions followed by up-sampling filters. The decoder network can be seen as a learned inverse transform, or a denoising and generative transform. The performance of a compression system is measured as a tradeoff between the number of bits needed to transmit versus the quality of the decoded content. For one of these tradeoffs, a compression model can be trained using a loss following the Lagrangian form ^^ ൌ ^^ ^ ^^ ^^, where R represents the rate or bitrate and D the distortion of the decoded content. FIG. 2 and FIG. 3 show autoencoders in an actual inference configuration that consists of an encoder producing a bitstream that is then transmitted and decoded. During training, as entropy encoding/decoding is non-differentiable, the entropy of ^^^ is determined with respect to a learned probability model ^^, as depicted in FIG.3. As for the distortion, in existing approaches, such NNs are trained using several types of losses that can be used alone or in combination. A loss based on an “objective” metric, typically Mean Squared Error (MSE), noted ^^^ െ ^^‖ଶ ଶ in FIG. 3 can be used or for instance based on structural similarity (SSIM). The as good as the second type, but the fidelity
Figure imgf000012_0001
to the original signal (image) is higher. Loss based on “subjective” (or subjective by proxy) can also be used, typically using Generative Adversarial Networks (GANs) during the training stage or advanced visual metric via a proxy NN. In the autoencoder presented above, the entropy encoder and decoder rely on a simple fully factorized prior, as depicted in FIG. 2. This method usually considers separate trained entropy models per channel of the latent. The spatial correlations in the latent ^^^ are not considered as each of the samples are encoded using the same distribution, i.e., assuming that they are independent and identically distributed (i.i.d). However, even after the processing by ^^^, ^^^ is not i.i.d and more recent approaches have taken on this specific issue. FIG.4 depicts an approach called auto-encoder with a hyperprior which is further described in [Ballé] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” ArXiv180201436 Cs Eess Math, May 2018, Accessed: Aug. 25, 2020. [Online]. Available: http://arxiv.org/abs/1802.01436, as the model now includes additional convolutional sequences ℎ^ and ℎ^ that output a learned distribution parameters for each element of the latent ^^^ respectively in a latent ^^ and ^^̂. For example, the learned distribution parameters are the scales or means and scales of Gaussian or Laplace distributions, for each element of the latent ^^^. The tensor ^^ output by ℎ^ needs to be encoded and transmitted as side information for the decoder to decode ^^^. The tensor ^^ is thus quantized to the tensor ^^̂ and entropy encoded (EE). On the decoder side, the bitstream representing the latent ^^̂ of the quantized learned parameters is entropy decoded (ED) and fed into the synthesis transform of the distribution information ℎ^. Transmitting that tensor ^^̂ using a fully factorized approach does not cost much overhead as ^^̂ corresponds to ^^ further downscaled by ℎ^ and the efficiency of using tailored gaussians for each element of the ^^^ dramatically surpasses the burden of transmitting the light ^^̂. Note that in FIG.4, which represent all the operations and elements necessary for the actual coding and decoding of an image, the synthesis of the distribution information ℎ^ appears at both the encoder and the decoder. Like in traditional video coding, the encoder contains parts of the decoder to generate the exact same metadata that the decoder will decode to process the rest of the bitstream. The synthesis ℎ^ must perform bit exact operations between the encoder and the decoder for the system to work. A slight difference in the generated parameters would completely crash the arithmetic decoder for ^^^. For example, an autoregressive context model is described in [Minnen] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems, 2018, pp.10771–10780, for estimating the entropy parameters ^^, ^^ (mu, sigma) of a Gaussian distribution representing the estimated probability distribution from which the currently-being-decoded tensor Cx1x1 fiber is drawn from. An example of an auto-encoder with a hyperprior architecture and context prediction is illustrated in FIG. 5. The tensor ^^^ is encoded or decoded in raster scan order from top-left to bottom-right along the image axes (H, W). At a given (x, y) location, a "context prediction" (cp in FIG. 5) is performed by applying a masked convolution kernel. An example of a Cx5x5 kernel is shown at the bottom left of FIG. 5. In the example of FIG. 5, a context is obtained for the black center location using the gray samples of the tensor. The context is output for the location (x,y) of the tensor ^^^, as a Cx1x1 vector that is for example a weighted combination of the causal samples covered by the kernel. This vector is then merged/concatenated (cat in FIG. 5) with the side- information ^^ = hs(Q(ha(y))) for that (x, y) tensor location. and then fed into the "entropy parameters" (ep) network which outputs the tensor of entropy parameters ^^, ^^ for location (x,y). An example of the entropy parameter network is illustrated on FIG. 5 as a sequence of 1x1 convolution layers and activation layers. The entropy parameter network allows to transform the input of a 3xCx1x1 vector resulting from the concatenation of the context with the hyperprior into a 2xCx1x1 vectors comprising the ^^, ^^ for the location (x,y) of the tensor. The same operations for determining the entropy parameters ^^, ^^ for each location (x,y) of the tensor are performed on the encoder and decoder parts. In [Minnen], the context prediction is obtained using spatial causal neighbors in a same channel. Basic autoregressive methods introduce sequential computations which breaks with the rest of the end-to-end NN-based codecs that rely on highly parallelizable operations. In the present document, some embodiments provide for improving the efficiency of the autoregressive context modelling, while designing an architecture that decreases the computational complexity. In [Ma] C. Ma, Z. Wang, R. Liao, and Y. Ye, “A Cross Channel Context Model for Latents in Deep Image Compression,” ArXiv210302884 Cs Eess, Mar.2021, Accessed: Apr.28, 2021. [Online]. Available: http://arxiv.org/abs/2103.02884, Ma et al. introduced a context model where previously decoded "groups" of tensor channels can be used to predict future groups. The idea in this scheme is that previously decoded "groups" of tensor channels are correlated with groups to decode. However, this correlation is rather loose; there is no strict enforcement of correlation between the groups other than the fact that they are all derived independently from what are effectively disjoint subspaces of a same source. In contrast, in the embodiments provided herein, "levels" of an input tensor to encode or decode are successively directly derived from previous "levels" of the tensor. The levels are defined as a representation of the tensor to encode or to reconstruct at different resolutions. This enforces a more direct correlation between the groups/levels. Some embodiments described further below provide improvements on the context models introduced in [Ballé] or [Minnen]. Some embodiments described below utilize context kernels at different resolutions to capture short and longer spatial correlations for an efficient modelling. According to an aspect of the present disclosure, a method for obtaining a context at different resolutions is provided. FIG. 6 illustrates an example of a method for predicting a context at different resolutions of a tensor, according to an embodiment. An input tensor y is downscaled (601, 602) at different spatial resolutions (y0, y1 and y2 on FIG.6, with y0 being the input tensor at the input resolution). For a location (x,y) of a tensor, and for each level y0, y1 and y2, a context prediction, respectively CP00, CP01 and CP02 is applied to the tensor of that level. Each context prediction outputs a Cx1x1 tensor representing the context for the location (x,y) at the considered level, C being the number of channels of the input tensor. If the level is a down-scaled level, the output of the context prediction is upscaled (603, 604) to the next level and combined (605, 606) with the next level, up to the input resolution to provide the final context (ctx on FIG.6) to be used for entropy encoding the location (x,y) of the input tensor y. In the example of FIG.6, three levels are defined, the output of the context prediction CP02 is upscaled (603) to the next upper resolution and combined (605) with the output of the context prediction CP01. The result of the combination is then upscaled (604) to the input resolution and combined (606) with the output of the context prediction CP00. In another variant, each output of a context prediction can be upscaled directly to the input resolution and combined with the out of the context prediction CP00. The same context prediction is done at the encoder and decoder sides so that the stream is correctly entropy decoded. On FIG. 6, three levels have been defined, but the mechanism for context prediction can be applied to more or less levels, for instance, only two levels, or more than three levels. In some embodiments, when the downscaled tensors are not transmitted to the decoder, only the causal parts of the downscaled tensor with respect to the location (x,y) currently being encoded/decoded can be used for determining the context prediction, for instance using the kernel illustrated on FIG.5. In other embodiments, the context predictions performed at the lower levels can take advantage of the full kernels. In these embodiments, the downscaled tensors are entropy encoded and transmitted to the decoder, in that case, the full decoded downscaled tensors can be used in the context prediction for entropy decoding a tensor of a higher level. In these embodiments, at the decoder side, there is no need to downscale the tensor y as the downscaled tensors are entropy decoded by the decoder and thus available to the decoder. FIG. 7 illustrates an embodiment of the context prediction determined using multi-resolution tensors as described in FIG. 6 is used for determining the entropy parameters used for entropy encoding or decoding an input tensor y. The input latent tensor y is downscaled to tensors of k+1 different levels y0, y1, …, yk, where the spatial resolutions of these tensors is (non-strictly) monotonically decreasing, and y = y0. In the embodiment of FIG.7, these tensors are compressed, serialized, and transmitted in reverse order yk, ..., y0 in pairs with compressed and serialized representations of their side information derived tensors ^^^, … ^^^, and they are decoded in this same order. At each level, the entropy parameters ^^^ , ^^^ for the current level tensor ^^^ may be predicted from any or all of (i) a causal context of the
Figure imgf000016_0001
level tensor ^^^, (ii) the full context of any previously decoded level tensors ^^^ା^, ... , ^^^, and (iii) the side information derived tensor ^^^ ൌ ℎ^^ ^^^ ^^^^^ ൌ ℎ^^ ^^^ℎ^^ ^^^^^ assosciated with the current level l. FIG. 7 visualizes an
Figure imgf000016_0002
example of an architecture of k + 1 = 3 levels uses all three aforementioned sources of available information. For ^^ ∈ ^ ^^, ... , ^^^, each yj is fed into a "context prediction" network CPlj. The resulting outputs are then upscaled to match the spatial dimensions, and then aggregated via a reduction operation (visualized by "+") such as summation or concatenation. The aggregated result is fed into an entropy parameters network EPl which also takes the side-information ^^^ as input, and outputs the entropy parameters ^^^ , ^^^ . These entropy parameters represent the Gaussian distributions ^^^ ~ ^^^ ^^^ , ^^^ ^ with which yl is entropy coded to produce a bitstream with rate Ryl along with the rate bitstream associated with the side information tensor zl.
Figure imgf000016_0003
FIG.8 is a diagram visualizing the effective mask positions for decoding a particular pixel of y0 in the embodiment of FIG.7. In order to maintain causality, CP00 must be masked. Since in this embodiment, the other levels already will have been decoded, they require no further masking and the full kernel can used for determining the context at these levels for a location y0 of the input tensor. In the embodiment of FIG.7, in order to maintain causality, a constraint that if j = l, then CPlj must not use elements of ^^^ that have not yet been decoded. If ^^^ is being decoded in spatial raster scan order (e.g. top-left to bottom-right with horizontal scan lines), then as shown in [Minnen] including an initial masked convolution of "L" shape within CPlj is sufficient to ensure causality. Furthermore, note that for j > l, it is not required to have any causality-induced constraints on CPlj, and it may instead initially use a standard unmasked convolution. FIG. 8 visualizes the appropriately masked context windows used during reconstruction of ^^^. Some embodiments described herein relate to methods and apparatuses for encoding or decoding a tensor using context prediction obtained from different resolutions of the tensor. In some embodiments, the tensor can be a latent representative of the at least one part of an image, an image, or a video, for instance generated using an auto-encoder as illustrated in FIG.2-5. FIG.9 illustrates an example of a method 900 for encoding a latent representative of at least one part of an image according to an embodiment. The method comprises at least the following steps for at least one sample of a first tensor representative of the at least one part of the image. At 901, a first context is obtained from at least one or more samples of at least one second tensor, the at least second tensor being obtained from at least one down-sampling of the first tensor. For instance, the first context is obtained as described above, the at least one second context is then up-sampled to an upper resolution, for instance the resolution of the first tensor or the resolution of a next tensor level. In a variant, the first context can also be determined using one or more causal samples of the first tensor, that is samples at the resolution of the input tensor. The context obtained from the one or more causal samples of the first tensor is combined with the up-sampled second context. Combining can be addition or aggregation of the contexts. At 902, at least one entropy parameter is determined based on the first context. In some variant the determining of the entropy parameters can also use the side-information derived for the first tensor. For instance, the side-information is an hyperprior learned for the first tensor. In this variant, as illustrated in FIG.4 and 5 for example, the method further comprises the encoding the hyperprior learned for the first tensor. At 903, the first tensor is entropy encoded using the determined at least one entropy parameter. In some variants, the at least one second tensor is also encoded so that all kernel samples can be used when determining the context prediction based on its level. In some embodiments, the entropy encoding of the at least one second tensor uses the same mechanisms of context prediction as for the input tensor in a recursive manner, as described with FIG.7 for example. Correlatively, FIG.10 illustrates an example of a method 100 for decoding the latent representative of the at least one part of the image according to an embodiment. The method comprises at least the following steps for at least one sample of a first tensor representative of the at least one part of the image. At 1001, a first context is obtained from at least one or more samples of at least one second tensor, the at least second tensor being representative of the first tensor at a lower resolution. At 1002, at least one entropy parameter is determined based on the first context. For correct entropy decoding of the first tensor, the first context and entropy parameters are obtained in a similar manner as in the encoding side. At 1003, the first tensor is entropy encoded using the determined at least one entropy parameter. In some variants, the lower resolutions of the first tensor are entropy decoded before determining the first context. In some embodiments, the above methods are used in a neural-network-based encoder and decoder respectively, such as an auto-encoder. Embodiments described herein provide for recursive level decoding. Similar to the recursive multi- level strategy used with wavelet transform-based compression, the pyramidal approach described herein allows each level of the pyramid to refer to a smaller version of itself while decoding. This can lead to savings in rate since a "lower resolution" version of a level is often a very good predictor of the "higher resolution" elements within current level. Described embodiments also provide a larger available spatial context window during reconstruction. Information about a larger "global" context can be used to improve the prediction of the entropy parameters ^^^ , ^^^ to better match ^^^, thus producing savings in rate. In some embodiments, "future" context is fully available at a lower resolution. In contrast with single-level masked autoregression, the information about "future" decoded pixels is already available at a lower resolution. For instance, in single-level raster scan autoregression, the context is limited to anything spatially above or to the left of the currently being decoded pixel. On the other hand, some embodiments of the method described herein also provides information about "future" decoded pixels spatially below or to the right of the currently being decoded pixel. This can help improve the prediction of the entropy parameters ^^^ , ^^^ to better match ^^^ thus producing savings in rate.
Figure imgf000018_0001
Described embodiments provide potentially similar RD performance as heavier methods, at a lower computational cost. As described in the variants, one may choose to use any autoregression method for decoding a given level. For instance, by choosing a lighter method of autoregression (i.e. something other than raster scan autoregression) in the largest level, one can save significantly on the amount of iterations of sequential, self-referential operations which are difficult to parallelize. Furthermore, the context captured is quite similar to the costlier autoregression methods, except perhaps for any high resolution more-like patterns within the tensor, which are fairly rare. In some variants of the method, any or no form of self-autoregression for reconstruction of ^^ ^^ can be used. Any level ^^^ may be reconstructed all at once, with raster scan autoregression, cross- context channel modeling (for example as in [Ma]), checkerboard autoregression as described in He et al.: D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y. Wang, “ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding.” arXiv, Mar. 29, 2022. doi: 10.48550/arXiv.2203.10886, or any other such method. Various levels need not use the same reconstruction method. In some variants, a gating function ^^ ^^ can be used on ^^ ^^ ൌ ^^ ^^^ ^^ ^^ି ^^ ^. One may include a trainable masking operation such as a convolution (or multiplication by sigmoid-activated attention map) in order to allow the model to selectively choose what parts of what is important at various levels. This is visualized in FIG.11 with the example of convolution operations (Conv on FIG.11) that are applied before the down-sampling operations. If any part of the tensor is deemed too costly to reconstruct precisely at any given level, the gating function may output a constant for that portion of the tensor. It is to be noted that the context prediction module CPlj needs to be able to take advantage of this. If the gating function is a convolution, then one natural choice for the gating function is a channel attention mechanism, though regular convolution may also provide its own advantages. In a variant, the embodiments described herein can be used for progressive decoding. Each level ^^^may be associated with a distortion loss term ^^^ ^^, ^^^^^ where ^^^^ ൌ ^^^^ ^^^^; ^^^ is the corresponding lower resolution image that is decoded from ^^^^. FIG.12 illustrates a block diagram of an auto-encoder with a hyperprior architecture and context prediction at different resolutions, according to an embodiment. In FIG. 12, only two levels of resolution (level 0 and level 1) are shown but more levels can also be used. For instance, the modules of the auto-encoder, such as the analysis function ga , synthesis function gs, hyperprior networks ℎ^ ^, ℎ^ ^, ℎ^ ^ , ℎ^ ^, EE, Q , and ED have similar functions as the corresponding
Figure imgf000019_0001
modules of FIG.4 or 5. On FIG.12, once the input tensor x is transformed to the latent y, the latent y is downscaled to a spatial resolution than the spatial resolution of the latent y. Both the latent y and its downscaled version are quantized to produce quantized latents ^^^^ and ^^^^ for each level 0 and 1. For level 1, in the embodiment of FIG.12, a hyperprior network is used for learning the side-information ^^^ for encoding the latent ^^^^. A quantized latent ^^̂^ representative of the side- information ^^^ is encoded, for example using a fully factorized prior model. For each spatial location (x,y) of the latent ^^^^, context prediction cp11 is applied using causal samples of the latent ^^^^. For instance, in FIG.12, the location (x,y) is illustrated as the central block sample in the 5x5 kernel and samples that are available (here the causal samples) are illustrated by the grey samples. The context output by the context prediction module cp11 is concatenated with the side-information ^^^. The resulting concatenated vector is fed to the entropy parameter network ep1 to output the entropy parameters ^^^, ^^^ used for entropy encoding the location (x,y) of ^^^^. On the decoder side,
Figure imgf000020_0001
latent ^^^^ is reconstructed by entropy decoding the hyperprior latent ^^̂^ and transforming the hyperprior latent ^^̂^ to the distribution parameters ^^^. A spatial location (x,y) of the latent ^^^^is entropy decoded using the context output by the context prediction module cp11 which considers only the causal samples of ^^^^ (that is the samples that have been already decoded). In a similar manner as in the encoding side, the context is concatenated with the reconstructed side-information ^^^ and fed to the entropy parameter network ep1 to output the entropy parameters ^^^, ^^^ used for entropy decoding the location (x,y) of ^^^^. For level 0, in the embodiment of FIG.12, a hyperprior network is used for learning the side- information ^^^ for encoding the latent ^^^^ . A quantized latent ^^̂^ representative of the side- information ^^^ is encoded, for example using a fully factorized prior model. For each spatial location (x,y) of the latent ^^^^, context prediction cp00 is determining using causal samples of the latent ^^^^ (using the grey samples in the kernel illustrated on FIG.12). Context prediction cp01 is also determined using full samples of the latent ^^^^ (as illustrated on FIG.12 all samples on the kernel used are shown in grey as it is considered that at the decoder side ^^^^ is already reconstructed). The context output by the context prediction module cp01 is upsampled to the spatial resolution of the latent ^^^^ and combined/added to the context output by the context prediction module cp00. The resulting context is concatenated with the side-information ^^^. The resulting concatenated vector is fed to the entropy parameter network ep0 to output the entropy parameters ^^^, ^^^ used for entropy encoding the location (x,y) of ^^^^. On the decoder side, the latent ^^^^ is reconstructed by entropy decoding the hyperprior latent ^^̂^ and transforming the hyperprior latent ^^̂^ to the distribution parameters ^^^. A spatial location (x,y) of the latent ^^^^is entropy decoded using the entropy parameters ^^^, ^^^ that are determined in a similar manner as on the encoder side. Context prediction cp00 is determined using causal samples of the latent ^^^^. Context prediction cp01 is also determined using full samples of the decoded latent ^^^^. The context output by the context prediction module cp01 is upsampled to the spatial resolution of the latent ^^^^ and combined/added to the context output by the context prediction module cp00. The resulting context is concatenated with the side-information ^^^. The resulting concatenated vector is fed to the entropy parameter network ep0. Once the latent ^^^^ is reconstructed, it can be fed to the synthesis function gs to reconstruct the input tensor ^^^, for instance an image, a video, or part of an image or of a video. FIG. 13 illustrates a block diagram of an embodiment of an auto-encoder with a hyperprior architecture and context prediction at different resolutions, according to another embodiment. The embodiment illustrated on FIG. 13 differs from the embodiment of FIG.12 in that the side- information ^^^ used for entropy encoding the tensor of the lower levels ( ^^^^) is not learned but derived from the side information ^^^ learned for the tensor at the input resolution ^^^^. For that, on FIG.13, the side information ^^^ is obtained by downsampling the side-information ^^^. Then, as in FIG.12, the
Figure imgf000021_0001
determined by the context prediction module cp11 is concatenated with the derived side-information ^^^to be fed into to the entropy parameter network ep1 for entropy encoding the tensor ^^^^. The
Figure imgf000021_0002
parts of the scheme illustrated in FIG.13 is similar to the corresponding modules of FIG.12. Thus, in the embodiment of FIG.13, there is no need to encode the hyperprior latent ^^̂^ as on the decoder side, the side-information ^^^ is derived by downsampling the reconstructed side information ^^^.
Figure imgf000021_0003
FIG. 14 shows one embodiment of an apparatus 1400 for compressing, encoding or decoding image or video using the aforementioned methods. The apparatus comprises a processor 1410 and can be interconnected to a memory 1420 through at least one port. Both the processor 1410 and the memory 1420 can also have one or more additional interconnections to external connections. The processor 1410 is also configured to either insert or receive information in a bitstream and, either compressing, encoding, or decoding using program code instructions implementing the aforementioned methods when executed by a processor. The program code to be loaded onto processor 1410 to perform the various aspects described in this application may be stored in a storage device and subsequently loaded onto memory 1420 for execution by processor 1410. The memory 1420 can be of any type appropriate to the technical environment and can be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory, and removable memory, as non- limiting examples. The processor 1410 can be of any type appropriate to the technical environment, and can encompass one or more of microprocessors, general purpose computers, special purpose computers, digital signal processors (DSPs), processors based on a single core architecture or on a multi-core architecture, sequential or parallel architectures, specialized circuits such as Field Programmable gate arrays (FPGA), application specific circuits (ASIC), signal processing devices and other processing circuitry, as non-limiting examples. According to an example of the present principles, illustrated in FIG.15, in a transmission context between two remote devices A and B over a communication network NET, the device A comprises a processor in relation with memory RAM and ROM which are configured to implement a method for encoding an image or a video as described using the aforementioned methods and the device B comprises a processor in relation with memory RAM and ROM which are configured to implement a method for decoding an image or a video as described using the aforementioned methods. In accordance with an example, the network is a broadcast network, adapted to broadcast/transmit encoded image or video from device A to decoding devices including the device B. A signal, intended to be transmitted by the device A, carries at least one bitstream comprising coded data representative of a tensor encoding an image or a video encoded according to the methods as explained above. In some embodiments, the signal also comprises one or more of the following: coded data representative of one or more downscaled versions of the tensor, coded data representative of side information learned for the tensor, or coded data representative of side information derived for the one or more downscaled versions of the tensor. FIG.16 shows an example of the syntax of such a signal when the coded data is transmitted over a packet-based transmission protocol. Each transmitted packet P comprises a header H and a payload PAYLOAD. According to an embodiment, the payload comprises coded data encoded according to any one of the embodiments described above. Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc. may be used in various embodiments to modify an element, component, step, operation, etc., such as, for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an ordering to the modified operations unless specifically required. So, in this example, the first decoding need not be performed before the second decoding, and may occur, for example, before, during, or in an overlapping time period with the second decoding. Various methods and other aspects described in this application can be used as additional modules or to modify modules of an image or video auto-encoder which are based on neural-network as shown in FIG.2-5. Unless indicated otherwise, or technically precluded, the aspects described in this application can be used individually or in combination. Various numeric values are used in the present application. The specific values are for example purposes and the aspects described are not limited to these specific values. Various implementations involve decoding. “Decoding”, as used in this application, can encompass all or part of the processes performed, for example, on a received encoded sequence in order to produce a final output suitable for display. In various embodiments, such processes include one or more of the processes typically performed by a decoder, for example, entropy decoding, inverse quantization, inverse transformation, and differential decoding. In various embodiments, such processes also, or alternatively, include processes performed by a decoder of various implementations described in this application. As further examples, in one embodiment “decoding” refers only to entropy decoding, in another embodiment “decoding” refers only to differential decoding, and in another embodiment “decoding” refers to a combination of entropy decoding and differential decoding. Whether the phrase “decoding process” is intended to refer specifically to a subset of operations or generally to the broader decoding process will be clear based on the context of the specific descriptions and is believed to be well understood by those skilled in the art. Various implementations involve encoding. In an analogous way to the above discussion about “decoding”, “encoding” as used in this application can encompass all or part of the processes performed, for example, on an input video sequence in order to produce an encoded bitstream. In various embodiments, such processes include one or more of the processes typically performed by an encoder, for example, partitioning, differential encoding, transformation, quantization, and entropy encoding. In various embodiments, such processes also, or alternatively, include processes performed by an encoder of various implementations described in this application. As further examples, in one embodiment “encoding” refers only to entropy encoding, in another embodiment “encoding” refers only to differential encoding, and in another embodiment “encoding” refers to a combination of differential encoding and entropy encoding. Whether the phrase “encoding process” is intended to refer specifically to a subset of operations or generally to the broader encoding process will be clear based on the context of the specific descriptions and is believed to be well understood by those skilled in the art. Note that the syntax elements as used herein are descriptive terms. As such, they do not preclude the use of other syntax element names. This disclosure has described various pieces of information, such as for example syntax, that can be transmitted or stored, for example. This information can be packaged or arranged in a variety of manners, including for example manners common in video standards such as putting the information into an SPS, a PPS, a NAL unit, a header (for example, a NAL unit header, or a slice header), or an SEI message. Other manners are also available, including for example manners common for system level or application level standards such as putting the information into one or more of the following: a. SDP (session description protocol), a format for describing multimedia communication sessions for the purposes of session announcement and session invitation, for example as described in RFCs and used in conjunction with RTP (Real-time Transport Protocol) transmission. b. DASH MPD (Media Presentation Description) Descriptors, for example as used in DASH and transmitted over HTTP, a Descriptor is associated to a Representation or collection of Representations to provide additional characteristic to the content Representation. c. RTP header extensions, for example as used during RTP streaming. d. ISO Base Media File Format, for example as used in OMAF and using boxes which are object-oriented building blocks defined by a unique type identifier and length also known as 'atoms' in some specifications. e. HLS (HTTP live Streaming) manifest transmitted over HTTP. A manifest can be associated, for example, to a version or collection of versions of a content to provide characteristics of the version or collection of versions. When a figure is presented as a flow diagram, it should be understood that it also provides a block diagram of a corresponding apparatus. Similarly, when a figure is presented as a block diagram, it should be understood that it also provides a flow diagram of a corresponding method/process. The implementations and aspects described herein can be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed can also be implemented in other forms (for example, an apparatus or program). An apparatus can be implemented in, for example, appropriate hardware, software, and firmware. The methods can be implemented in, for example, , a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users. Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout this application are not necessarily all referring to the same embodiment. Additionally, this application may refer to “determining” various pieces of information. Determining the information can include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory. Further, this application may refer to “accessing” various pieces of information. Accessing the information can include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information. Additionally, this application may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information can include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information. It is to be appreciated that the use of any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed. Also, as used herein, the word “signal” refers to, among other things, indicating something to a corresponding decoder. In this way, in an embodiment the same parameter is used at both the encoder side and the decoder side. Thus, for example, an encoder can transmit (explicit signaling) a particular parameter to the decoder so that the decoder can use the same particular parameter. Conversely, if the decoder already has the particular parameter as well as others, then signaling can be used without transmitting (implicit signaling) to simply allow the decoder to know and select the particular parameter. By avoiding transmission of any actual functions, a bit savings is realized in various embodiments. It is to be appreciated that signaling can be accomplished in a variety of ways. For example, one or more syntax elements, flags, and so forth are used to signal information to a corresponding decoder in various embodiments. While the preceding relates to the verb form of the word “signal”, the word “signal” can also be used herein as a noun. As will be evident to one of ordinary skill in the art, implementations can produce a variety of signals formatted to carry information that can be, for example, stored or transmitted. The information can include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal can be formatted to carry the bitstream of a described embodiment. Such a signal can be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting can include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries can be, for example, analog or digital information. The signal can be transmitted over a variety of different wired or wireless links, as is known. The signal can be stored on a processor-readable medium. We describe a number of embodiments. Features of these embodiments can be provided alone or in any combination, across various claim categories and types. Further, embodiments can include one or more of the following features, devices, or aspects, alone or in any combination, across various claim categories and types.

Claims

CLAIMS 1. A method comprising, for at least one sample of a first tensor representative of at least one part of an image: obtaining a first context from at least one or more samples of at least one second tensor, the at least second tensor being obtained from at least one down-sampling of the first tensor; determining at least one entropy parameter based on the first context; and entropy encoding the first tensor using the determined at least one entropy parameter.
2. The method of claim 1, wherein obtaining the first context comprises: determining at least one second context from the one or more samples of the at least one second tensor; and upsampling the at least one second context to a resolution of the first tensor.
3. The method of claim 2, wherein obtaining the first context is further based on a third context determined from one or more causal samples of the first tensor.
4. The method of claim 3, wherein obtaining the first context comprises combining the third context and the upsampled at least one second context.
5. The method of claim 1, wherein determining the at least one entropy parameter is further based on side information derived for the first tensor.
6. The method of claim 5, wherein the side information derived for the first tensor is a first hyperprior learned for the first tensor, the method further comprising encoding the first hyperprior.
7. The method of claim 6, wherein determining the at least one entropy parameter comprises concatenating the first hyperprior and the first context.
8. The method of claim 7, wherein determining the at least one entropy parameter is based on a network comprising at least one convolution layer.
9. The method of claim 1, wherein the method further comprises entropy encoding the at least one second tensor.
10. The method of claim 9, wherein entropy encoding the at least one second tensor is based on at least one entropy parameter determined using a context obtained for at least one sample of the at least one second tensor based on one or more causal samples of the at least one second tensor.
11. The method of claim 9, wherein entropy encoding the at least one second tensor is based on at least one entropy parameter determined from side information derived for the at least one second tensor.
12. The method of claim 11, wherein the side information derived for the at least one second tensor is a second hyperprior derived for the at least one second tensor, the method further comprising encoding the second hyperprior.
13. The method of claim 11, wherein the side information derived for the at least one second tensor is a down-sampled hyperprior derived for the first tensor.
14. The method of claim 1, wherein the at least one entropy parameter is representative of a Gaussian distribution of the at least one sample of the first tensor.
15. The method of claim 1, wherein the at least one second tensor is obtained using a masking operation applied to the first tensor before the at least one down-sampling of the first tensor.
16. The method of claim 15, wherein the masking operation is one of a convolution or a multiplication by a sigmoid-activated attention map.
17. The method of claim 1, wherein the first tensor is obtained from an encoding of the at least one part of the image using a neural network.
18. The method of claim 17, wherein the neural network is an auto-encoder.
19. An apparatus comprising one or more processors configured to entropy encode a first tensor representative of at least one part of an image according to the method of claim 1.
20. A method, comprising reconstructing a first tensor representative of at least one part of an image, the method comprising, for at least one sample of the first tensor: obtaining a first context from at least one or more samples of at least one second tensor, the at least second tensor being representative of the first tensor at a lower resolution; determining at least one entropy parameter based on the first context; and entropy decoding the first tensor using the determined at least one entropy parameter.
21. The method of claim 20, wherein obtaining the first context comprises: determining at least one second context from the one or more samples of the at least one second tensor; and upsampling the at least one second context to a resolution of the first tensor.
22. The method of claim 21, wherein obtaining the first context is further based on a third context determined from one or more causal samples of the first tensor.
23. The method of claim 22, wherein obtaining the first context comprises combining the third context and the upsampled at least one second context.
24. The method of claim 20, wherein determining the at least one entropy parameter is further based on side information derived for the first tensor.
25. The method of claim 24, wherein the side information decoded for the first tensor is a first hyperprior derived for the first tensor, the method further comprising decoding the first hyperprior.
26. The method of claim 25, wherein determining the at least one entropy parameter comprises concatenating the first hyperprior and the first context.
27. The method of claim 26, wherein determining the at least one entropy parameter is based on a network comprising at least one convolution layer.
28. The method of claim 20, wherein the method further comprises entropy decoding the at least one second tensor.
29. The method of claim 28, wherein entropy decoding the at least one second tensor is based on at least one entropy parameter determined using a context obtained for at least one sample of the at least one second tensor based on one or more causal samples of the at least one second tensor.
30. The method of claim 28, wherein entropy decoding the at least one second tensor is based on at least one entropy parameter determined from side information derived for the at least one second tensor.
31. The method of claim 30, wherein the side information derived for the at least one second tensor is a second hyperprior derived for the at least one second tensor, the method further comprising decoding the second hyperprior.
32. The method of claim 30, wherein the side information derived for the at least one second tensor is a down-sampled hyperprior derived for the first tensor.
33. The method of claim 20, wherein the at least one entropy parameter is representative of a Gaussian distribution of the at least one sample of the first tensor.
34. The method of claim 20, wherein the at least one part of the image is reconstructed from the first tensor using a neural network.
35. The method of claim 34, wherein the neural network is decoder part of an auto-encoder.
36. An apparatus comprising one or more processors configured to reconstruct a first tensor representative of at least one part of an image according to the method of claim 20.
37. A signal comprising first coded data representative of a first tensor encoding at least one part of an image and second coded data representative of at least one second tensor, the at least second tensor being representative of the first tensor at a lower resolution, the at least one second tensor being used for obtaining a first context used for determining at least one entropy parameter used for entropy decoding the first tensor.
38. The signal of claim 37, further comprising side information derived for the first tensor and used for determining the at least one entropy parameter for entropy decoding the first tensor.
39. The signal of claim 38, wherein the side information is a first hyperprior derived for the first tensor.
40. The signal of claim 37, wherein the signal further comprises side information derived for the at least one second tensor and used for determining at least one entropy parameter used for entropy decoding the at least one second tensor.
41. The signal of claim 40, wherein the side information derived for the at least one second tensor is a second hyperprior derived for the at least one second tensor.
42. A computer readable storage medium having stored thereon the signal according to claim 37.
43. A computer readable storage medium having stored thereon instructions for causing one or more processors to perform the method of claim 1.
44. A computer readable storage medium having stored thereon instructions for causing one or more processors to perform the method of claim 20.
45. A computer program product including instructions which, when the program is executed by one or more processors, causes the one or more processors to carry out the method of claim 1.
46. A computer program product including instructions which, when the program is executed by one or more processors, causes the one or more processors to carry out the method of claim 20.
47. A device comprising: an apparatus according to claim 34; and at least one of (i) an antenna configured to receive a signal according to of claim 37, (ii) a band limiter configured to limit the received signal to a band of frequencies that includes the first and second coded data, or (iii) a display configured to display a reconstructed version of the at least one part of an image.
48. A device according to claim 47, comprising a TV, a cell phone, a tablet or a Set Top Box.
PCT/US2024/013557 2023-02-01 2024-01-30 A method and an apparatus for encoding/decoding at least one part of an image using multi-level context model Ceased WO2024163481A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP24710246.0A EP4659447A1 (en) 2023-02-01 2024-01-30 A method and an apparatus for encoding/decoding at least one part of an image using multi-level context model
CN202480007920.8A CN120642331A (en) 2023-02-01 2024-01-30 Method and apparatus for encoding/decoding at least a portion of an image using a multi-level context model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363442506P 2023-02-01 2023-02-01
US63/442,506 2023-02-01

Publications (1)

Publication Number Publication Date
WO2024163481A1 true WO2024163481A1 (en) 2024-08-08

Family

ID=90362150

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/013557 Ceased WO2024163481A1 (en) 2023-02-01 2024-01-30 A method and an apparatus for encoding/decoding at least one part of an image using multi-level context model

Country Status (3)

Country Link
EP (1) EP4659447A1 (en)
CN (1) CN120642331A (en)
WO (1) WO2024163481A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022084702A1 (en) * 2020-10-23 2022-04-28 Deep Render Ltd Image encoding and decoding, video encoding and decoding: methods, systems and training methods
WO2022258162A1 (en) * 2021-06-09 2022-12-15 Huawei Technologies Co., Ltd. Parallelized context modelling using information shared between patches

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022084702A1 (en) * 2020-10-23 2022-04-28 Deep Render Ltd Image encoding and decoding, video encoding and decoding: methods, systems and training methods
WO2022258162A1 (en) * 2021-06-09 2022-12-15 Huawei Technologies Co., Ltd. Parallelized context modelling using information shared between patches

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
C. MAZ. WANGR. LIAOY. YE: "A Cross Channel Context Model for Latents in Deep Image Compression", ARXIV210302884 CS EESS, 28 April 2021 (2021-04-28), Retrieved from the Internet <URL:http://arxiv.org/abs/2103.02884>
D. HEZ. YANGW. PENGR. MAH. QINY. WANG: "ELIC: Efficient Learned Image Compression with Unevenly Grouped Space-Channel Contextual Adaptive Coding", ARXIV, 29 March 2022 (2022-03-29)
D. MINNENJ. BALLÉG. D. TODERICI: "Joint autoregressive and hierarchical priors for learned image compression", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2018, pages 10771 - 10780
J. BALLÉD. MINNENS. SINGHS. J. HWANGN. JOHNSTON: "Variational image compression with a scale hyperprior", ARXIVL80201436 CS EESS MATH, 25 August 2020 (2020-08-25), Retrieved from the Internet <URL:http://arxiv.org/αbs/1802.01436>
JING ZHOU ET AL: "Multi-scale and Context-adaptive Entropy Model for Image Compression", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 October 2019 (2019-10-17), XP081516959 *
LIU ZIYI ET AL: "Learned Image Compression with Multi-Scale Spatial and Contextual Information Fusion", 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), IEEE, 16 October 2022 (2022-10-16), pages 706 - 710, XP034292704, DOI: 10.1109/ICIP46576.2022.9897285 *
TONG CHEN ET AL: "Neural Image Compression via Non-Local Attention Optimization and Improved Context Modeling", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 11 October 2019 (2019-10-11), XP091441073, DOI: 10.1109/TIP.2021.3058615 *

Also Published As

Publication number Publication date
EP4659447A1 (en) 2025-12-10
CN120642331A (en) 2025-09-12

Similar Documents

Publication Publication Date Title
US20240380929A1 (en) A method and an apparatus for encoding/decoding images and videos using artificial neural network based tools
US20230396801A1 (en) Learned video compression framework for multiple machine tasks
WO2021254855A1 (en) Systems and methods for encoding/decoding a deep neural network
US20250247538A1 (en) Deep-learning-based compression method using frequency decomposition
US20250150626A1 (en) Block-based compression and latent space intra prediction
EP4309367A1 (en) Motion flow coding for deep learning based yuv video compression
KR20250087554A (en) Latent coding for end-to-end image/video compression
WO2024163481A1 (en) A method and an apparatus for encoding/decoding at least one part of an image using multi-level context model
EP4659451A1 (en) A method and an apparatus for encoding/decoding at least one part of an image using one or more multi-resolution transform blocks
CN119999196A (en) Method or apparatus for rescaling a tensor of feature data using an interpolation filter
WO2025056421A1 (en) Dictionary-driven implicit neural representation for image and video compression
WO2024158896A1 (en) Multi-residual autoencoder for image and video compression
EP4605859A1 (en) Method and device for fine-tuning a selected set of parameters in a deep coding system
WO2025168360A1 (en) Multiscale dictionary learning and training of inr network
WO2025011935A1 (en) Approximating implicit neural representation through learnt dictionary atoms
WO2025256970A1 (en) Efficient compression of coding tree unit based implicit neural representation with neural network coding standard
WO2025153372A1 (en) Efficient implementation of dictionary-based implicit neural representation
EP4602810A1 (en) Training method of an end-to-end neural network based compression system
EP4241451A1 (en) Learned video compression and connectors for multiple machine tasks
JP2025517698A (en) Method and apparatus for implementing neural network-based processing with low complexity - Patents.com
WO2025168361A1 (en) Updated dictionary-driven implicit neural representation for image and video compression
WO2025219070A1 (en) Method and device for image enhancement based on residual coding using invertible deep network
EP4584751A1 (en) Methods and apparatuses for encoding and decoding a point cloud

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24710246

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202480007920.8

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 202517073022

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 202517073022

Country of ref document: IN

WWP Wipo information: published in national office

Ref document number: 202480007920.8

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 2024710246

Country of ref document: EP