[go: up one dir, main page]

WO2025219940A1 - End-to-end learned coding via overfitting a latent generator - Google Patents

End-to-end learned coding via overfitting a latent generator

Info

Publication number
WO2025219940A1
WO2025219940A1 PCT/IB2025/054064 IB2025054064W WO2025219940A1 WO 2025219940 A1 WO2025219940 A1 WO 2025219940A1 IB 2025054064 W IB2025054064 W IB 2025054064W WO 2025219940 A1 WO2025219940 A1 WO 2025219940A1
Authority
WO
WIPO (PCT)
Prior art keywords
generator
latent
input
neural networks
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/IB2025/054064
Other languages
French (fr)
Inventor
Francesco Cricrì
Honglei Zhang
Nannan ZOU
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2025219940A1 publication Critical patent/WO2025219940A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/597Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding specially adapted for multi-view video sequence encoding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/46Embedding additional information in the video signal during the compression process
    • H04N19/463Embedding additional information in the video signal during the compression process by compressing encoding parameters before transmission
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors

Definitions

  • FIG.1 illustrates an example of a modified video coding pipeline based on neural networks.
  • FIG.2 shows an example of an end-to-end learned codec.
  • FIG.3 shows a system pipeline for video coding for machine (VCM).
  • FIG.4 illustrates an example of encoder-side operations.
  • FIG.5 illustrates an example of decoder or receiver side operations.
  • FIG.6 shows an example of a decoder.
  • FIG.7 shows an example of an encoder and a decoder.
  • FIG. 8 shows an example where an input to a latent generator (LG) comprises a latent tensor that was generated by the LG at a previous iteration.
  • FIG. latent generator
  • FIG. 9 shows an example where a generated latent tensor prediction is combined with a latent tensor residual using a combination operation.
  • FIG. 10 shows an example where a latent tensor residual is derived from a signal received from an encoder.
  • FIG.11 shows an example where a generated latent tensor residual that is an output of the LG is combined with a latent tensor prediction using a combination operation.
  • FIG.12 shows an example where a latent tensor prediction is an output of a neural network.
  • FIG.13 shows an example where an input to a visual generator (VG) comprises a generated latent tensor that was output by the LG at a previous iteration, if it is available, and a latent tensor that is output by the LG at a current iteration.
  • FIG. 14 shows an example where at each iteration, an input to the VG comprises an output of the VG at a previous iteration, when it is available, and a generated latent tensor that is output by the LG at a current iteration.
  • FIG. 14 shows an example where at each iteration, an input to the VG comprises an output of the VG at a previous iteration, when it is available, and a generated latent tensor that is output by the LG at a current iteration.
  • FIG. 15 shows an example of a codec where an encoder of the codec determines and encodes one or more spatial coordinates and one or more parameter values based on input data and where a decoder of the codec decodes and uses the one or more spatial coordinates and the one or more parameter values for decoding data.
  • FIG.16 shows an example of determining one or more values of respective one or more parameters of the one or more neural networks of the LG of the decoder.
  • FIG. 17 shows an example of training one or more neural networks in the VG of the decoder.
  • FIG.18 shows an example of using a compression objective during the training of the VG of the decoder.
  • FIG. 19 shows an example of training one or more neural networks (NNs) of the LG jointly with one or more NNs of the VG.
  • FIG.20 shows an example where one or more NNs of the LG are trained after the one or more NNs of the VG have been trained and after a first NN has been trained.
  • FIG. 21 shows schematically a user equipment suitable for employing embodiments of the examples described herein.
  • FIG.22 is a block diagram illustrating a system in accordance with an example.
  • FIG.23 is an example apparatus configured to implement the examples described herein.
  • FIG.24 shows a representation of an example of non-volatile memory media used to store instructions that implement the examples described herein.
  • FIG.25 shows an encoder according to an embodiment.
  • FIG.26 shows a decoder according to an embodiment.
  • FIG.27 is an example method, based on the examples described herein.
  • FIG.28 is an example method, based on the examples described herein.
  • FIG.29 is an example method, based on the examples described herein.
  • FIG.30 is an example method, based on the examples described herein.
  • DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS [0034] Fundamentals of neural networks
  • a neural network (NN) may be described as a computation graph consisting of several layers of computation. Each layer may consist of one or more units, where each unit performs an elementary computation.
  • a unit is connected to one or more other units, and the connection may be associated with a weight.
  • the weight may be used for scaling the signal passing through the associated connection.
  • Weights are learnable parameters, i.e., values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers.
  • initial layers such as convolutional neural networks for image classification
  • semantically low-level features such as edges and textures in images
  • intermediate layers extract more high-level features.
  • After the feature extraction layers there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, etc.
  • Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, such as mobile phones. Examples include image and video analysis and processing, social media data analysis, device usage data analysis, etc.
  • One property of neural nets (and other machine learning tools) is that they are able to learn properties from input data, e.g., in supervised way or in unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing the training signal.
  • the training algorithm consists of changing some properties of the neural network so that its output is as close as possible to a desired output.
  • the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to.
  • Training usually happens by minimizing or decreasing the output’s error, also referred to as the loss or loss function. Examples of losses are mean squared error, cross-entropy, etc.
  • loss is mean squared error, cross-entropy, etc.
  • training is an iterative process, where at each iteration the algorithm modifies the weights of the neural net to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss, by means of gradient descent technique.
  • gradients of the loss function with respect to one or more weights or parameters of the NN are computed, for example by backpropagation technique; the computed gradients are then used by an optimization routine, such as Adam or Stochastic Gradient Descent (SGD) to obtain an update to the one or more weights or parameters.
  • an optimization routine such as Adam or Stochastic Gradient Descent (SGD) to obtain an update to the one or more weights or parameters.
  • model “neural network”, “neural net” and “network” are used interchangeably herein, and also the weights of neural networks are sometimes referred to as learnable parameters or simply as parameters.
  • Training a neural network is an optimization process, but the final goal may be different from the typical goal of optimization. In optimization, the only goal is to minimize a function.
  • the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset.
  • the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e., data which was not used for training the model.
  • This is usually referred to as generalization.
  • data is usually split into at least two sets, the training set and the validation set.
  • the training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss.
  • the validation set is used for checking the performance of the network on data which was not used to minimize the loss, as an indication of the final performance of the model.
  • the errors on the training set and on the validation set are monitored during the training process to understand the following things: [0042] --If the network is learning at all – in this case, the training set error should decrease, otherwise the model is in the regime of underfitting. [0043] --If the network is learning to generalize – in this case, also the validation set error needs to decrease and to be not too much higher than the training set error. If the training set error is low, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model may be in the regime of overfitting. This means that the model has just memorized the training set’s properties and performs well only on that set, but performs poorly on a set not used for tuning its parameters.
  • Video codec consists of an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form. Typically encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate).
  • Typical hybrid video codecs for example ITU-T H.263 and H.264, encode the video information in two phases.
  • pixel values in a certain picture area are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner).
  • the prediction error i.e. the difference between the predicted block of pixels and the original block of pixels. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients.
  • DCT Discrete Cosine Transform
  • Inter prediction which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy.
  • inter prediction the sources of prediction are previously decoded pictures (also known as reference pictures).
  • temporal inter prediction the sources of prediction are previously decoded pictures in the same scalable layer.
  • intra block copy IBC; also known as intra-block- copy prediction
  • prediction may be applied similarly to temporal inter prediction but the reference picture is the current picture and only previously decoded samples can be referred in the prediction process.
  • Inter-layer or inter-view prediction may be applied similarly to temporal inter prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively.
  • inter prediction may refer to temporal inter prediction only, while in other cases inter prediction may refer collectively to temporal inter prediction and any of intra block copy, inter-layer prediction, and inter- view prediction provided that they are performed with the same or similar process than temporal prediction.
  • Inter prediction, temporal inter prediction, or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction.
  • Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, i.e., either sample values or transform coefficients can be predicted.
  • Intra prediction is typically exploited in intra coding, where no inter prediction is applied.
  • One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction.
  • the decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame.
  • the decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence.
  • the motion information is indicated with motion vectors associated with each motion compensated image block.
  • Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures.
  • those are typically coded differentially with respect to block specific predicted motion vectors.
  • the predicted motion vectors are created in a predefined way, for example calculating the median of the encoded or decoded motion vectors of the adjacent blocks.
  • Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor.
  • the reference index of previously coded/decoded picture can be predicted.
  • the reference index is typically predicted from adjacent blocks and/or or co- located blocks in temporal reference picture.
  • typical high efficiency video codecs employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction.
  • predicting the motion field information is carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks.
  • Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, e.g. the desired Macroblock mode and associated motion vectors.
  • This kind of cost function uses a weighting factor ⁇ to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area:
  • C D + ⁇ R (Equation 1)
  • C the Lagrangian cost to be minimized
  • D the image distortion (e.g. Mean Squared Error) with the mode and motion vectors considered
  • R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors).
  • Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike.
  • SEI Supplemental Enhancement Information
  • Some video coding specifications include SEI NAL units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike and the latter type can end a picture unit or alike.
  • An SEI NAL unit contains one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation.
  • SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/VVC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use.
  • the standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance.
  • One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified.
  • NNs neural networks
  • VVC/H.266-compliant codec a non-learned codec
  • “non-learned” means those codecs whose components and their parameters are typically not learned from data by means of machine learning techniques.
  • An in-loop filter for example a NN that works as an additional in-loop filter with respect to the non-learned loop filters, or a NN that works as the only additional in-loop filter, thus replacing any other in-loop filter; Intra-frame prediction; Inter-frame prediction; Transform and/or inverse transform; Probability model for lossless coding; Etc.
  • NNs are used as the main components of the image/video codecs.
  • the codec may still comprise components which are not based on machine learning techniques.
  • Option 1 re-use the non-learned video coding pipeline, but replace most or all the components with NNs, as shown in FIG.1.
  • FIG. 1 it illustrates an example of modified video coding pipeline based on neural networks.
  • An example of neural network may include, but is not limited, a compressed representation of a neural network.
  • FIG. 1 is shown to include following components: [0063] – A neural transform block or circuit 102: this block or circuit transforms the output of a summation/subtraction operation 103 to a new representation of that data, which may have lower entropy and thus be more compressible.
  • a quantization block or circuit 104 this block or circuit quantizes an input data 101 to a smaller set of possible values.
  • An inverse transform and inverse quantization blocks or circuits 106 These blocks or circuits perform the inverse or approximately inverse operation of the transform and the quantization, respectively.
  • An encoder parameter control block or circuit 108 This block or circuit may control and optimize some or all the parameters of the encoding process, such as parameters of one or more of the encoding blocks or circuits.
  • An entropy coding block or circuit 110 This block or circuit may perform lossless coding, for example, based on entropy.
  • An encoder 114 may be an encoder block or circuit, such as the neural encoder part of an auto-encoder neural network.
  • a decoder 116 may be a decoder block or circuit, such as the neural decoder part of an auto-encoder neural network.
  • An intra-coding block or circuit 118 may be a block or circuit performing some intermediate steps between encoder and decoder, such as quantization, entropy encoding, entropy decoding, and/or inverse quantization.
  • a deep loop filter block or circuit 120 This block or circuit performs filtering of reconstructed data, in order to enhance it.
  • a decode picture buffer block or circuit 122 This block or circuit is a memory buffer, keeping the decoded frame, for example, reconstructed frames 124 and enhanced reference frames 126 to be used for inter prediction.
  • An inter-prediction block or circuit 128 This block or circuit performs inter- frame prediction, for example, predicts from frames, for example, frames 132, which are temporally nearby.
  • An ME/MC 130 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction.
  • ME/MC stands for motion estimation / motion compensation.
  • Option 2 also referred to as end-to-end learned coding: re-design the whole pipeline as a neural network auto-encoder with a quantization and lossless coding in the middle part, as follows: [0074] – Encoder NN (also referred to as neural network based encoder, or NN encoder): may perform a non-linear transformation of the input. The output is typically referred to as latent tensor. [0075] – Quantization and lossless encoding of the encoder NN’s output.
  • Decoder NN also referred to as neural network based decoder, or NN decoder: may perform a non-linear inverse transformation from dequantized latent tensor to a reconstructed input.
  • NN decoder may perform a non-linear inverse transformation from dequantized latent tensor to a reconstructed input.
  • a typical neural network-based end-to-end learned coding system contains an encoder 202 and a decoder 204.
  • the encoder 202 comprises an encoder NN 206, a quantizer or quantization operation 208, a probability model 210, a lossless encoder 212 (for example arithmetic encoder).
  • the decoder 204 comprises a lossless decoder 214 (for example, an arithmetic decoder), a probability model 216, a dequantizer or dequantization 218, a decoder NN 220.
  • a lossless decoder 214 for example, an arithmetic decoder
  • the probability model 210 present at encoder side and the probability model 216 present at decoder side may be same or substantially the same. For example, they may be two copies of the same probability model.
  • the lossless encoder 212 and the lossless decoder 214 form a lossless codec 222.
  • a lossless codec such as lossless codec 222 may be an entropy-based lossless codec.
  • lossless codec is an arithmetic codec, such as a context-adaptive binary arithmetic coding CABAC.
  • the encoder NN 206 and decoder NN 220 are typically two neural networks, or mainly comprise neural network components.
  • the probability model (210, 216) may also be a neural network and/or comprise mainly neural network components, and may be referred to as neural network based probability model or learned probability model.
  • the term lossless codec may refer to a system that comprise also the probability model, in addition to, for example, an arithmetic encoder and an arithmetic decoder.
  • the latent tensor ⁇ 226 may be a 4D tensor, where the four dimensions of the latent tensor ⁇ 226 represent sample dimension (also sometimes referred to as batch dimension, which is the dimension along which different samples of data can be placed), a channel dimension, a vertical dimension (also sometimes referred to as height dimension) and a horizontal dimension (also sometimes referred to as width dimension).
  • sample dimension also sometimes referred to as batch dimension, which is the dimension along which different samples of data can be placed
  • a channel dimension also sometimes referred to as height dimension
  • a horizontal dimension also sometimes referred to as width dimension
  • the quantized latent tensor ⁇ ⁇ 228 is lossless-encoded into a bitstream ⁇ 230 by the lossless encoder 212, based also on the output 232 of the probability model 210.
  • the probability model takes as input at least part of the quantized latent tensor ⁇ ⁇ 228 and outputs 232 an estimate of a probability or an estimate of a probability distribution or an estimate of one or more parameters of a probability distribution for one or more elements of the quantized latent tensor.
  • the bitstream ⁇ 230 represents an encoded or compressed version of the input ⁇ 224.
  • the reconstructed input ⁇ 240 may also be referred to as reconstructed data, or reconstruction, or decoded data, or decoded input, or decoded output, or decoded image, or decoded video, and the like.
  • This is a simplified description of the end-to-end learned video or image coding system 200, and it is to be understood that more sophisticated designs or variations of this design are possible.
  • the encoder 202 may comprise some or all of the components of the decoder 204, even if the some or all of the components of the decoder 204 are not shown as being part of the encoder 202 in FIG.2.
  • D is a distortion loss term
  • R is a rate loss term
  • is a weight that controls the balance between the two losses.
  • the distortion loss term may be referred to also as reconstruction loss term, or simply reconstruction loss.
  • the rate loss term may be referred to simply as rate loss.
  • the distortion loss term measures the quality of the reconstructed or decoded output, and may comprise (but may not be limited to) one or more of the following: Mean square error (MSE); Structure similarity (SSIM); MS-SSIM; Losses derived from the use of a pretrained neural network. For example, error(f1, f2), where f1 and f2 are the features extracted by a pretrained neural network for the input data and the decoded data, respectively, and error() is an error or distance function, such as L1 norm or L2 norm; Losses derived from the use of a neural network that is trained simultaneously with the end-to-end learned codec.
  • MSE Mean square error
  • SSIM Structure similarity
  • MS-SSIM MS-SSIM
  • Losses derived from the use of a pretrained neural network For example, error(f1, f2), where f1 and f2 are the features extracted by a pretrained neural network for the input data and the decoded data, respectively, and error() is an error or distance function,
  • the estimated performance of one or more machine analysis tasks may comprise a distortion computed based at least on a first set of features extracted from an output of the decoder and a second set of features extracted from a respective ground truth data, where the first set of features and the second set of features are output by one or more layers of a pretrained feature-extraction neural network.
  • Multiple distortion losses may be used and integrated into D, such as a weighted sum of MSE and SSIM.
  • the rate loss term is derived from the output of the probability model, and it represents the estimated entropy of the quantized latent representation, which indicates the number of bits necessary to represent the quantized latent tensor;
  • a sparsification loss i.e., a loss that encourages the quantized latent tensor to comprise many zeros.
  • Examples are L0 norm, L1 norm, L1 norm divided by L2 norm.
  • one or more reconstruction losses may be used, and one or more rate losses may be used.
  • the one or more reconstruction losses and/or one or more rate losses are combined by means of a weighted sum.
  • the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion performance. For example, if more weight is given to the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy (as measured by a metric that correlates with the reconstruction losses).
  • These weights are usually considered to be hyper- parameters of the training process, and may be set manually by the person designing the training process, or automatically for example by grid search or by using additional neural networks. [0101] In one case, the training process may be performed jointly with respect to the distortion loss D and the rate loss R.
  • the training process may be performed in two alternating phases, where in a first phase only the distortion loss D may be used, and in a second phase only the rate loss R may be used.
  • the system would comprise only the probability model and lossless encoder and lossless decoder.
  • the loss function would comprise only the rate loss, since the distortion loss is always zero (i.e., no loss of information).
  • the inference phase, or inference stage, or inference time, or test time are referred to as the phase when a neural network or a codec is used for its purpose, such as encoding and decoding an input image.
  • VCM Video Coding for Machines
  • These multiple machines may be used in a certain combination which is for example determined by an orchestrator sub-system.
  • the multiple machines may be used for example in succession, based on the output of the previously used machine, and/or in parallel.
  • a video may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of all the pixels in the frames.
  • NN machine
  • Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, etc.
  • automatic analysis and processing is increasingly being performed for other types of data, such as audio, speech, text.
  • Compressing (and decompressing) data where the end user comprises machines is commonly referred to as compression or coding for machines.
  • machines e.g., neural networks
  • VCM video compression or coding for machines
  • Compressing for machines may differ from compressing for humans for example with respect to the algorithms and technology used in the codec, or the training losses used to train any neural network components of the codec, or the evaluation methodology of codecs.
  • FIG.3 is a general illustration of a pipeline 300 of Video Coding for Machines.
  • a VCM encoder 304 encodes the original video 302 into a bitstream 306.
  • a bitrate 310 may be computed 308 from the bitstream 306, as a measure of the size of the bitstream.
  • a VCM decoder 311 decodes the bitstream 306 that was produced by the VCM encoder 304.
  • the output of the VCM decoder 311 is referred in the figure as “Decoded data for machines”, which may also be referred to as data 312.
  • This data 312 may be considered as the decoded or reconstructed video.
  • this data 312 may not have same or similar characteristics as the original video 302 which was input to the VCM encoder 304.
  • this data 312 may not be easily understandable by a human by simply rendering the data onto a screen.
  • the data 312, which is the output of VCM decoder 311, is then input to one or more task neural networks.
  • VCM task-NNs
  • object detection task NN 314, object segmentation task NN 316, object tracking task NN 318) objects detection task NN 314
  • object tracking task NN 318 object tracking task NN 318
  • Task-NN X 320 a non-specified one
  • One goal of VCM may be to obtain a low bitrate while guaranteeing that the task-NNs (e.g., the object detection task NN 314, the object segmentation task NN 316, the object tracking task NN 318, the Task-NN X 320) still perform well in terms of the evaluation metric associated to each task.
  • the VCM decoder 311 may not be present.
  • the machines are run directly on the bitstream 306.
  • the VCM decoder 311 may comprise only a lossless decoding stage, and the lossless decoded data is provided as input to the machines.
  • the VCM decoder 311 may comprise a lossless decoding stage following by a dequantization operation, and the loss-decoded and dequantized data is provided as input to the machines. [0113] As shown in FIG.
  • the performance of the object detection task NN 314 is evaluated 322 to generate task performance 332
  • the performance of the object segmentation task NN 316 is evaluated 324 to generate task performance 334
  • the performance of object tracking task NN 318 is evaluated 326 to generate task performance 336
  • the performance of task-NN X 320 is evaluated 328 to generate task performance 338.
  • a video encoder such as a H.266/VVC encoder
  • one or more of the following approaches may be used to adapt the encoding to be suitable to machine analysis tasks:
  • One or more regions of interest (ROIs) may be detected.
  • An ROI detection method may be used.
  • ROI detection may be performed using a task NN, such as an object detection NN.
  • ROI boundaries of a group of pictures or an intra period may be spatially overlaid and rectangular areas may be formed to cover the ROI boundaries.
  • the detected ROIs (or rectangular areas, likewise) may be used in one or more of the following ways: o
  • the quantization parameter (QP) may be adjusted spatially in a manner that ROIs are encoded using finer quantization step size(s) than other regions. For example, QP may be adjusted CTU-wise.
  • the video is preprocessed to contain only the ROIs, while the other areas are replaced by one or more constant values or removed.
  • the video is preprocessed so that the areas outside the ROIs are blurred or filtered.
  • a grid is formed in a manner that a single grid cell covers a ROI. Grid rows or grid columns that contain no ROIs are downsampled as preprocessing to encoding.
  • --Quantization parameter of the highest temporal sublayer(s) is increased (i.e. coarser quantization is used) when compared to practices for human watchable video.
  • --The original video is temporally downsampled as preprocessing prior to encoding.
  • a frame rate upsampling method may be used as postprocessing subsequent to decoding, if machine analysis at the original frame rate is desired.
  • --A filter is used to preprocess the input to the encoder.
  • the filter may be a machine learning based filter, such as a convolutional neural network.
  • a neural network may be used for filtering or processing input data. Such a neural network may be referred to as a neural network based filter, or simply NN filter.
  • a NN filter may comprise one or more neural networks, and/or one or more components that may not be categorized as neural networks.
  • the purpose of a NN filter may comprise (but may not be limited to) visual enhancement, colorization, upsampling, super-resolution, inpainting, temporal extrapolation, generating content, and the like.
  • a neural network may be used as filter in the encoding and decoding loop (also referred to simply as coding loop), and it may be referred to as neural network loop filter, or neural network in-loop filter.
  • the NN loop filter may replace all other loop filters of an existing video codec, or may represent an additional loop filter with respect to the already present loop filters in an existing video codec.
  • a neural network filter may be used as post-processing filter for a codec, e.g., may be applied to an output of an image or video decoder in order to remove or reduce coding artifacts.
  • a codec is a modified VVC/H.266 compliant codec (e.g., a VVC/H.266 compliant codec that has been modified and thus it may not be compliant to the VVC/H.266) that comprises one or more NN loop filters.
  • An input to the one or more NN loop filters may comprise at least a reconstructed block or frames (simply referred to as reconstruction) or data derived from a reconstructed block or frame (e.g., the output of a loop filter).
  • the reconstruction may be obtained based on predicting a block or frame (e.g., by means of intra-frame prediction or inter-frame prediction) and performing residual compensation.
  • the one or more NN loop filters may enhance the quality of at least one of their input, so that a rate-distortion loss is decreased.
  • the rate may indicate a bitrate (estimate or real) of the encoded video.
  • the distortion may indicate a pixel fidelity distortion such as the following: Mean-squared error (MSE); Mean absolute error (MAE); Mean Average Precision (mAP) computed based on the output of a task NN (such as an object detection NN) when the input is the output of the post-processing NN; Other machine task- related metric, for tasks such as object tracking, video activity classification, video anomaly detection, etc.
  • MSE Mean-squared error
  • MAE Mean absolute error
  • mAP Mean Average Precision
  • the enhancement may result into a coding gain, which can be expressed for example in terms of BD-rate or BD-PSNR.
  • a neural network filter may be used as post-processing filter for a codec, e.g., may be applied to an output of an image or video decoder in order to remove or reduce coding artifacts.
  • the NN filter is used as a post-processing filter where the input comprises data that is output by or is derived from an output of a non-learned decoder, such as a decoder that is compliant with the VVC/H.266 standard.
  • the NN filter is used as a post-processing filter where the input comprises data that is output by or is derived from an output of a decoder of an end-to-end learned decoder.
  • a filter may take as input at least one or more first images to be filtered and may output at least one or more second images, where the one or more second images are the filtered version of the one or more first images.
  • the filter takes as input one image and outputs one image.
  • the filter takes as input more than one image and outputs one image.
  • the filter takes as input more than one image and outputs more than one image.
  • a filter may take as input also other data (also referred to as auxiliary data, or extra data) than the data that is to be filtered, such as data that can aid the filter to perform a better filtering than if no auxiliary data was provided as input.
  • the auxiliary data comprises information about prediction data, and/or information about the picture type, and/or information about the slice type, and/or information about a Quantization Parameter (QP) used for encoding, and/or information about boundary strength, etc.
  • the filter takes as input one image and other data associated to that image, such as information about the quantization parameter (QP) used for quantizing and/or dequantizing that image, and outputs one image.
  • a NN filter can be adapted at test time based at least on part of the data to be encoded and/or decoded and/or post-processed.
  • similar adaptation may be performed for other coding tools and/or post-processing tools that are based on neural network technology.
  • an adaptation or overfitting may be performed on a neural network based intra-frame prediction, or a neural network based inter- frame prediction, or a latent tensor generator, or a visual data generator, or a media generator, or a data generator, etc.
  • Such operation may be referred to, for example, with one of the following terms, when their meaning is clear from the context: adaptation, content adaptation, overfitting, finetuning, optimization, specialization, and the like.
  • the NN filter that results from the adaptation process may be referred to, for example, with one of the following terms: adapted filter, content-adapted filter, overfitted filter, finetuned filter, optimized filter, specialized filter, and the like.
  • the overfitting process may be performed at encoder side based on a training process. The resulting overfitted filter is then used to derive an overfitting signal, or adaptation signal.
  • the adaptation signal may be compressed and then signaled from the encoder to the decoder, in or along a bitstream that represents encoded data, such as an encoded image or video.
  • FIG. 4 illustrates an example of such encoder-side operations of encoder 400.
  • ⁇ 401 represents an input to the NN filter 402
  • ⁇ 404 represents an output of the NN filter 402
  • ⁇ 405 represents a ground-truth data associated with ⁇ 401
  • “Compute loss” 406 computes a training loss ⁇ 408 in order to overfit the NN filter 402
  • “Overfit” 410 uses the training loss ⁇ 408 to overfit the NN filter 402.
  • an overfitted NN filter 414 is obtained, which is used, together with the original NN filter 416, to derive 418 an adaptation signal 420.
  • the adaptation signal 420 is compressed by using compression steps 422, and the compressed adaptation signal 424 is signaled 426 to a decoder or receiver.
  • the overfitting signal, or a signal derived from the compressed adaptation signal 424 such as decompressed adaptation signal 506 resulting from decompression, by using decompression steps 504 of the compressed adaptation signal 424, is used to update by using the updating process 502 the original NN filter 416.
  • the updated NN filter 510 is then used to filter one or more pictures, or one or more blocks.
  • FIG. 5 thus illustrates an example of such decoder 500 or receiver side operations.
  • the NN filter e.g., the overfitted NN filter 414 that is obtained from the overfitting process at encoder 400 may be different from the updated NN filter 510 that is obtained from the updating process 502 at decoder side. For example, one reason may be that the adaptation signal 420 may be compressed by using compression steps 422 in a lossy way.
  • the former NN filter (e.g., the overfitted NN filter 414) may be referred to as overfitted filter or adapted filter (or other similar terms, see above), and the latter NN filter may be referred to as updated NN filter 510.
  • Overfitting process performed at encoder side [0142] The adaptation process starts with an initial NN filter (e.g., the original NN filter 416, the NN filter 402). It is to be noted that, before the adaptation or overfitting process has started, the original NN filter 416 and the NN filter 402 may be same or substantially same. During the adaptation or overfitting process, the NN filter 402 may be modified, thus may become different from the original NN filter 416.
  • the initial NN filter (e.g., the original NN filter 416, the NN filter 402) is a pretrained NN filter, which was pretrained during an offline stage on a sufficiently large dataset.
  • the initial NN filter (e.g., the original NN filter 416, the NN filter 402) is a randomly initialized NN filter.
  • one or more parameters of the NN filter 402 may be adapted.
  • Such parameters may include (but may not be limited to) the following: The bias terms of a convolutional neural network; Multiplier parameters, that multiply one or more tensors produced by the NN filter 402, such as one or more feature tensors that are output by respective one or more layers of the NN filter 402; Parameters of the kernels of a convolutional neural network; Parameters of an adapter layer; One or more arrays or tensors that are used as input to respective one or more layers of the NN filter 402. [0144]
  • the adaptation may be performed by means of a training process, e.g., by minimizing a loss function until a stopping criterion is met.
  • the data used for this training process may comprise one or more pictures or blocks of input 401 to the NN filter 402 and associated respective one or more pictures or blocks of ground-truth data 405.
  • the filter is an in-loop filter
  • the input to the NN filter 402 is reconstruction data, after prediction and residual compensation
  • the ground-truth data is the uncompressed data that is given as input to the encoder.
  • the input to the NN filter 402 is decoded data (e.g., the output of a video decoder);
  • the ground-truth data 405 is the uncompressed data that is given as input to the encoder.
  • the “overfitting process” is an iterative process where, at each iteration, the NN filter 402 may change; so, at the beginning, NN filter 402 and original NN filter 416 are the same initial filter; then, with each overfitting iteration, NN filter 402 may change, eventually becoming the overfitted NN filter 414.
  • the loss function (used with compute loss 406) used during the training process may comprise one or more distortion loss functions (also referred to as reconstruction loss functions) and zero or more rate loss functions.
  • a rate loss function may measure, for example, the cost in terms of bitrate of signaling any adaptation signal, such as updates to the parameters of the NN filter.
  • a distortion loss function may comprise one of MSE, MS- SSIM, VMAF, etc.
  • the adaptation signal 420 may be derived based on the overfitted NN filter 414 and on the original NN filter 416 (e.g., the NN filter 402 before the overfitting process).
  • the adaptation signal 420 comprises an update to one or more parameters of the NN filter (e.g., the overfitted NN filter 414, original NN filter416), e.g., a difference between the values of one or more parameters of the overfitted NN filter 414 and the values of respective one or more parameters of the original NN filter 416.
  • Such update may also be referred to as a weight update, or parameter update.
  • Such update may be computed, for example, by subtracting the values of the adapted parameters (i.e., the parameters of the overfitted NN filter 414) from the corresponding values of the original parameters (i.e., the parameters of the original NN filter 416).
  • the adaptation signal 420 comprises the parameters (of the overfitted NN filter 414) that were adapted, also referred to as updated parameters, or adapted parameters, or adapted weights, or overfitted parameters, and the like.
  • the adaptation signal 420 may go through one or more compression steps 422, such as sparsification, quantization and lossless coding.
  • an encoder that compresses the adaptation signal into a bitstream that is compliant with a neural network compression standard, such as MPEG NNC (ISO/IEC 15938-17), may be used.
  • the compressed adaptation signal 424 may be signaled 426 from encoder to decoder in or along a bitstream that represents encoded image or video data.
  • the compressed adaptation signal 424 is signaled 426 in an Adaptation Parameter Set (APS) syntax structure of a video coding bitstream.
  • the compressed adaptation signal 424 is signaled 426 in a Supplemental Enhancement Information (SEI) message of a video coding bitstream.
  • SEI Supplemental Enhancement Information
  • Signaling 426 may comprise also other information which is associated with the compressed adaptation signal 424 and that may be required for correctly parsing and/or decompressing and/or using the compressed adaptation signal 424, such as any quantization parameters.
  • the compressed adaptation signal 424 may be the only signal or bitstream that is sent from an encoder to a decoder and may represent an encoded image or video.
  • the compressed adaptation signal 424 is received and decompressed 504.
  • the decompressed adaptation signal 506 may then be used to update (e.g., by using the updating process 502) the original NN filter 416, resulting in updated NN filter 510.
  • the adaptation signal e.g., the compressed adaptation signal 424, the decompressed adaptation signal 506
  • the weight update comprises one or more updates to respective one or more parameters of the original NN filter 416
  • the updating process 502 adds the one or more updates to the one or more parameters.
  • the adaptation signal (e.g., the compressed adaptation signal 424, the decompressed adaptation signal 506) comprises one or more updated or adapted parameters
  • the updating process 502 replaces respective one or more parameters of the original NN filter 416 with the one or more updated or adapted parameters.
  • the updated NN filter 510 may be used for its purpose. For example, for filtering an input picture or an input block, or for decoding an image.
  • Terminology [0159] The terms frame, picture and image may be used interchangeably.
  • the input and output to an end-to-end learned codec may be pictures.
  • the input and output of a NN filter may be pictures.
  • block when it means a portion of a picture, may be simply referred to as frame or picture or image.
  • at least some of the embodiments herein, even when described as applied to a picture, may be applicable also to a block, e.g., to a portion of a picture.
  • the examples described herein address the problem of how to design and train a variation of an end-to-end learned codec that is efficient in terms of at least one of the following: decoding complexity, decoding speed, and/or rate-distortion cost.
  • image and video are considered as the data types.
  • image and video data may be collectively referred to as visual data, and it is to be understood that visual data may refer to either image data or video data or both.
  • signal, data and tensor may be used interchangeably to indicate an input or an output.
  • a decoder 602 comprises at least a latent generator (LG) 604 and a visual generator (VG) 606, where the LG 604 is provided with an input 603, where an input to the VG comprises an output 607 of the LG 604 or a signal derived from an output 607 of the LG 604, and where an output ⁇ 610 of the VG 606 is used to derive a final output, where the final output may be a decoded image or decoded video or decoded visual data. In one example, the final output is the output ⁇ 610.
  • ⁇ 603 is an input to the LG 604, ⁇ 607 is an output of the LG 604 and is input to the VG 606, the output ⁇ 610 is an output of the VG 606, such as a decoded picture or decoded video.
  • an encoder may comprise at least some of the components of the decoder, such as the LG and the VG.
  • the LG comprises one or more neural networks, for example a neural network that comprises one or more (but not limited to) the following types of layers or blocks: convolutional layers, residual layers or blocks, ResNet layers or blocks, transformer layers or blocks, attention layers or blocks, or non-linear layers, such as Rectified Linear Units (ReLU).
  • the LG comprises an auto-encoder architecture that comprises Transformer layers or blocks.
  • the LG comprises a convolutional auto- encoder architecture, such as a U-Net architecture.
  • the VG comprises one or more neural networks, for example a neural network that comprises one or more (but not limited to) the following types of layers or blocks: convolutional layers, residual layers or blocks, ResNet layers or blocks, transformer layers or blocks, attention layers or blocks, non-linear layers, such as Rectified Linear Units (ReLU).
  • the VG comprises a convolutional decoder-style neural network, such as a neural network that comprises convolutional layers and it increases the spatial and/or temporal resolution of its input signal, for example by using transpose convolutional layers or upsampling layers or interpolation layers or pixel-shuffle layers.
  • an input to the LG comprises one or more indicators of respective one or more portions of data to be generated by the LG.
  • an input to the LG comprises one or more indicators of respective one or more portions of data to be decoded by the decoder.
  • an input to the LG comprises a signal that indicates one or more spatial and/or temporal locations of a tensor that is generated by the LG.
  • an input to the LG comprises a signal that indicates one or more spatial and/or temporal locations of a data, such as of an image or video, that is decoded by the decoder.
  • an input to the LG comprises a signal that indicates one or more viewpoints (or viewing directions) of a tensor that is generated by the LG.
  • an input to the LG comprises a signal that indicates one or more viewpoints (or viewing directions) of a data, such as of a picture or video, that is decoded by the decoder.
  • an input to the LG comprises one or more spatial and/or temporal coordinates that indicate respective one or more spatial and/or temporal locations of a tensor that is generated by the LG.
  • an input to the LG comprises one or more spatial and/or temporal coordinates that indicate respective one or more spatial and/or temporal locations of a data, such as of an image or video, that is decoded by the decoder.
  • an input to the LG comprises one or more coordinates of respective one or more pixels to be decoded by the decoder.
  • an input to the LG comprises one or more coordinates or identifiers of respective one or more blocks to be decoded by the decoder, where a block comprises two or more pixels.
  • an input to the LG comprises one or more identifiers of respective one or more pictures, such as frames of a video, to be decoded by the decoder.
  • an input to the LG comprises a noise signal, such as a tensor where the values of the elements of the tensor are randomly sampled based at least on a probability distribution.
  • an input to the LG comprises a tensor that may be referred to as input latent tensor.
  • the input latent tensor may comprise an output of a neural network or a signal that is derived from an output of a neural network, where the neural network may be comprised in an encoder.
  • a neural network outputs a latent tensor, where the latent tensor is then encoded in a lossy and/or lossless way; at decoder side, the encoded latent tensor is decoded and provided as input to the LG; any operations that may be needed to decode the encoded latent tensor may be considered to be part of the encoder. See the FIG.7 for an illustration of this example. [0186] In FIG. 7, ⁇ 702 in an input to an encoder 704 and to an encoder NN 706.
  • An output of an encoder NN 706 is a latent tensor ⁇ 708, which is lossless-encoded 710, eventually after a quantization operation (not shown in FIG.7), obtaining a bitstream ⁇ 712 which represents an output of the encoder 704.
  • the bitstream ⁇ 712 is input to a decoder 714 and to a lossless decoding process 716, obtaining a decoded latent tensor ⁇ 718, eventually after a dequantization operation (not shown in FIG.7).
  • the decoded latent tensor ⁇ 718 is input to a LG 720, obtaining a tensor ⁇ 722.
  • the tensor ⁇ 722 is input to a VG 724, obtaining a decoded output ⁇ 726, such as a decoded picture or decoded video, which represents also an output of the decoder 714.
  • an input to the LG comprises one or more values of respective one or more parameters of the one or more neural networks comprised in the LG.
  • the decoder may receive, from an encoder, one or more values of respective one or more parameters of the one or more neural networks of the LG, and may assign the received one or more values to the respective one or more parameters of the one or more neural networks of the LG.
  • an input to the LG comprises one or more updates to respective one or more parameters of the one or more neural networks comprised in the LG.
  • the decoder may receive, from an encoder, one or more updates to respective one or more parameters of the one or more neural networks of the LG, and may apply the one or more updates to the respective one or more parameters of the one or more neural networks of the LG.
  • an input to the LG comprises one or more signals that are input to respective one or more layers or blocks of the one or more neural networks comprised in the LG.
  • an input to the LG comprises one or more signals that modulate or modify respective one or more inputs and/or one or more outputs of respective one or more layers or blocks of the one or more neural networks comprised in the LG.
  • an input to the LG comprises a signal that is derived from one or more previous outputs of the LG.
  • the decoder 802 is a video decoder. At each iteration, the decoder 802 decodes an encoded picture or an encoded frame of a video (804) into a decoded picture or a decoded frame of a video (e.g., second decoded frame ⁇ ⁇ 814), respectively.
  • an input to the LG 806 may comprise a first generated latent tensor ⁇ ⁇ 808 that was generated by the LG 806 at a previous iteration, if it is available.
  • the LG 806 outputs a first generated latent tensor ⁇ ⁇ 808, that is provided as input to the VG 810, where the VG 810 outputs a first decoded frame ⁇ ⁇ .
  • an input to the LG 806 comprises the first generated latent ⁇ ⁇ 808, and outputs a second generated latent tensor ⁇ ⁇ 812, that is provided as input to the VG 810, where the VG 810 outputs a second decoded frame ⁇ ⁇ 814.
  • the LG 806 may not take any previous generated latent tensor or may be provided a dummy previous generated latent tensor, such as a tensor with all zero-valued elements.
  • FIG.8 provides an illustration of this example. [0195] In FIG.
  • Delay tap 816 may comprise, for example, a buffer that stores the previous output (from the i-th iteration) 808 of the LG 806 and provides it as an input to the LG 806 at the current (i+1)-th iteration.
  • one or more locations of data may refer to, for example, one or more coordinates of respective one or more pixels of an image, or one or more indexes of respective one or more frames of a video, or one or more identifiers of respective one or more portions of a data (e.g., one or more identifiers of respective one or more blocks of an image), or one or more indicators of respective one or more portions of a data.
  • an output of the LG is a latent tensor, such as a feature tensor or simply features. This latent tensor may be referred to as generated tensor, generated output, generated latent tensor, or output latent tensor.
  • the generated latent tensor may be input to the VG.
  • an output of the LG may be combined with a combination signal by means of a combination operation, obtaining a combination output.
  • the combination output may be an input to the VG.
  • the combination operation may be an element-wise sum.
  • the combination signal may be, for example, an output of another neural network, or may be derived based on a signal received from an encoder.
  • an output of the LG 902 of decoder 900 comprises a prediction of a latent tensor, also referred to as generated latent tensor prediction 904.
  • the generated latent tensor prediction 904 may be combined with a latent tensor residual ⁇ ⁇ 906 by means of a combination operation 908.
  • the combination operation 908 may comprise a summation.
  • ⁇ ⁇ 904 represents the generated latent tensor prediction
  • ⁇ ⁇ 906 represents the latent tensor residual
  • ⁇ 910 represents a latent tensor that is the result of combination operation 908 the generated latent tensor prediction 904 and the latent tensor residual ⁇ ⁇ 906 by means of an element-wise summation.
  • the latent tensor residual may be an output of a neural network.
  • the latent tensor residual may be derived from a signal received from an encoder.
  • an encoder determines an uncompressed latent tensor residual by subtracting a ground-truth latent tensor from a generated latent tensor prediction that is output by an LG, encodes the uncompressed latent tensor residual in a lossy and/or lossless way; at decoder side, the encoded latent tensor residual is decoded into a decoded latent tensor residual; the decoded latent tensor residual is then used as the latent tensor residual that is combined with the generated latent tensor prediction.
  • FIG. 10 illustrates this example.
  • an uncompressed latent tensor residual ⁇ ⁇ 1004 is lossless-encoded 1006 into a bitstream ⁇ 1008.
  • the bitstream ⁇ 1008 is decoded 1012 into a latent tensor residual ⁇ ⁇ 1014 which is summed 1016 to the generated latent tensor prediction 1018 generated with LG 1020.
  • an output of the LG 1102 of a decoder 1101 comprises a residual of a latent tensor, also referred to as generated latent tensor residual 1104.
  • the generated latent tensor residual 1104 may be combined with a latent tensor prediction 1106 by means of a combination operation 1108.
  • the combination operation 1108 may comprise a summation.
  • FIG. 11 provides an illustration of this embodiment.
  • the latent tensor prediction 1202 may be an output of a neural network, such as a NN based predictor 1204.
  • FIG.12 illustrates this embodiment, where in the example shown in FIG. 12, the NN based predictor 1204 is part of decoder 1201.
  • an input to the NN based predictor 1204 may comprise, but may not be limited to, one or more signals that are derived from respective one or more of the following: a previous generated latent tensor; or a previous latent tensor that is the result of combining a previous generated latent tensor residual and a previous latent tensor prediction; or a previous decoded picture.
  • a previous generated latent tensor or a previous latent tensor that is the result of combining a previous generated latent tensor residual and a previous latent tensor prediction; or a previous decoded picture.
  • an input to the VG is a signal or tensor that is the result of a combination of an output of the LG and of another signal such as a prediction signal or a residual signal.
  • an input to the VG is a tensor that is obtained as the result of an element-wise sum of a generated latent tensor prediction and a latent tensor residual.
  • an input to the VG comprises a signal that is derived from one or more previous outputs of the LG.
  • the decoder 1301 is a video decoder. At each iteration, the decoder 1301 decodes an encoded picture or an encoded frame of a video (1303) into a decoded picture or a decoded frame (e.g., a second decoded frame ⁇ ⁇ 1312) of a video, respectively.
  • an input to the VG 1310 comprises a generated latent tensor that was output by the LG 1302 at a previous iteration, when it is available, and a latent tensor that is output by the LG 1302 at the current iteration.
  • the LG 1302 outputs a first generated latent tensor ⁇ ⁇ 1306, that is provided as input to the VG 1310, where the VG 1310 outputs a first frame ⁇ ⁇ .
  • the LG 1302 outputs a second generated latent tensor ⁇ ⁇ and an input to the VG 1310 comprises the second generated latent tensor ⁇ ⁇ 1308 and the first generated latent tensor ⁇ ⁇ 1306.
  • an output of the VG 1310 comprises the second decoded frame ⁇ ⁇ 1312.
  • the VG 1310 may not take any previous generated latent tensor or may be provided a dummy previous generated latent tensor, such as a tensor with all zero-valued elements.
  • FIG.13 illustrates this example.
  • an input to the VG comprises a signal that is derived from one or more previous outputs of the VG.
  • the decoder 1401 is a video decoder. At each iteration, the decoder 1401 decodes an encoded picture or an encoded frame of a video (1402) into a decoded picture or a decoded frame (e.g., a second decoded frame ⁇ ⁇ ⁇ 1410) of a video, respectively.
  • an input to the VG 1408 comprises an output of the VG 1408 at a previous iteration, when it is available, and a generated latent tensor that is output by the LG 1404 at the current iteration.
  • the LG 1404 outputs a first generated latent tensor ⁇ ⁇ , that is provided as input to the VG 1408, where the VG 1408 outputs a first decoded frame ⁇ ⁇ 1406.
  • the LG 1404 outputs a second generated latent tensor ⁇ ⁇ 1405.
  • An input to the VG 1408 comprises the second generated latent tensor ⁇ ⁇ ⁇ 1405 and the first decoded frame ⁇ ⁇ ⁇ 1406.
  • An output of the VG 1408 comprises a second decoded frame ⁇ ⁇ 1410.
  • the VG 1408 may not take any previous decoded frame, or may be provided a dummy decoded frame, such as a tensor with all zero-valued elements.
  • FIG. 14 illustrates this example.
  • a delay tap 1412 stores the output (e.g., the first decoded frame ⁇ ⁇ 1406, the second decoded frame ⁇ ⁇ 1410) of the VG 1408 that is provided as input to the VG 1408 at a subsequent iteration.
  • an input to the LG may be used as an input to the VG.
  • the one or more spatial and/or temporal locations, or a signal derived from the one or more spatial and/or temporal locations are/is provided as an input to the VG and represent respective spatial and/or temporal locations of a decoded picture or decoded video.
  • an input to the VG comprises one or more values of respective one or more parameters of the one or more neural networks comprised in the VG.
  • an input to the VG comprises one or more updates to respective one or more parameters of the one or more neural networks comprised in the VG.
  • an input to the VG comprises one or more signals that are input to respective one or more layers or blocks of the one or more neural networks comprised in the VG.
  • an input to the VG comprises one or more signals that modulate or modify respective one or more inputs and/or one or more outputs of respective one or more layers or blocks of the one or more neural networks comprised in the VG.
  • an output of the VG comprises a decoded data, such as a decoded image, or decoded video, or decoded audio, or decoded media data.
  • a decoded data such as a decoded image, or decoded video, or decoded audio, or decoded media data.
  • an output of the VG comprises decoded one or more portions of data, and wherein the decoded one or more portions of data correspond to or are indicated by the one or more indicators.
  • the decoder receives from an encoder one or more indicators of respective one or more portions of data to decode, or a signal derived from one or more indicators of respective one or more portions of data to decode such as encoded or compressed one or more indicators, where the one or more indicators may be used as an input to the LG and/or the VG.
  • the decoder may first decode or decompress the encoded or compressed one or more indicators, obtaining decoded or decompressed one or more indicators, and the decoded or decompressed one or more indicators may be used as an input to the LG and/or the VG.
  • the decoder receives from an encoder one or more spatial and/or temporal locations or identifiers, or a signal derived from one or more spatial and/or temporal locations or identifiers, where the one or more spatial and/or temporal locations or identifiers, or the signal derived from the one or more spatial and/or temporal locations or identifiers are used as an input to the LG.
  • an image codec 1500 comprises an encoder 1502 and a decoder 1522, where the decoder 1522 is based at least on some of the embodiments above, thus comprising at least a LG 1532 and a VG 1536, where the LG 1532 and the VG 1536 comprise a neural network.
  • the encoder 1502 encodes a first information 1508 (e.g., by using a block “Spatial coordinates encoding” 1510) about one or more spatial locations of an image to be decoded by the decoder 1522, such as one or more spatial coordinates; this encoded first information may also be referred to as encoded spatial locations ⁇ ⁇ 1514.
  • the spatial coordinates may comprise normalized spatial coordinates, for example normalized to a range between 0 and 1.
  • the encoder 1502 encodes second information (e.g., by using a block “Parameter encoding” 1512)about one or more values ⁇ 1509 of respective one or more parameters of the neural network comprised in the LG 1532; this encoded second information may be referred to also as encoded parameter values ⁇ ⁇ 1516.
  • the encoded spatial locations ⁇ ⁇ 1514 and the encoded parameter values ⁇ ⁇ 1516 are provided to the decoder 1522.
  • the decoder 1522 decodes (e.g., by using a block “Spatial coordinates decoding” 1524, block “Parameter decoding” 1526) the encoded spatial locations ⁇ ⁇ 1514 and the encoded parameter values ⁇ ⁇ 1516, obtaining the one or more spatial locations or coordinates ⁇ 1528 and the decoded one or more parameter values ⁇ 1530.
  • the decoded one or more parameter values ⁇ 1530 are assigned to the respective parameters in the neural network comprised in the LG 1532.
  • the spatial locations or coordinates ⁇ 1528 are input to the LG 1532.
  • the LG 1532 is executed or run, obtaining a generated latent tensor ⁇ 1534.
  • the generated latent tensor ⁇ 1534 is input to the VG 1536.
  • the VG 1536 is executed or run, obtaining a decoded image or one or more decoded pixels ⁇ 1540.
  • the one or more values ⁇ 1509 of respective one or more parameters of the neural network comprised in the LG 1532 are determined at encoder side based at least on a parameter determination process(e.g., by using a block “Determine LG parameters” 1506).
  • the parameter determination process comprises performing a training process, or a finetuning process, or an overfitting process, on the neural network comprised in the LG 1532, based at least on the image(s) to be decoded by the decoder.
  • FIG 15 thus illustrates this example.
  • the encoder 1502 may comprise also some or all of the components of the decoder 1522, even if they are not illustrated in FIG. 15 to be part of the encoder 1502.
  • an image ⁇ 1503 is input to a block “Determine spatial coordinates” 1504 which determines one or more spatial coordinates, represented in the figure by ⁇ 1508, that identify respective one or more spatial locations of respective one or more pixels to be decoded.
  • the one or more spatial coordinates are encoded by a block “Spatial coordinates encoding” 1510, that may perform, for example, lossless encoding, obtaining a first bitstream comprising the encoded spatial locations ⁇ ⁇ 1514.
  • the first bitstream comprising the encoded spatial locations ⁇ ⁇ 1514 is decoded by a block “Spatial coordinates decoding” 1524, obtaining decoded one or more spatial locations or coordinates ⁇ 1528.
  • the image ⁇ 1503 is input to the block “Determine LG parameters” 1506, which determines one or more values of respective one or more parameters of a neural network in the LG 1532, where these one or more values are represented by ⁇ 1509.
  • the one or more values ⁇ 1509 are encoded by a block “Parameter encoding” 1512, obtaining a second bitstream comprising the encoded parameter values ⁇ ⁇ 1516, where the encoding, by the block “Parameter encoding” 1512, may comprise a lossy encoding operation, such as a quantization operation, and a lossless decoding operation.
  • the second bitstream comprising the encoded parameter values ⁇ ⁇ 1516 is decoded by a block “Parameter decoding” 1526, obtaining decoded one or more parameter values ⁇ 1530, where the decoding by the block “Parameter decoding” 1526 may comprise, for example, a lossless decoding operation and a dequantization operation.
  • the decoded one or more parameter values ⁇ 1530 are used as values for respective one or more parameters of the neural network in the LG 1532.
  • the decoder 1522 assigns the decoded one or more parameter values ⁇ 1530 to the respective one or more parameters of the neural network in the LG 1532.
  • the LG 1532 is executed or run on the input represented by the decoded one or more spatial locations or coordinates ⁇ 1528, obtaining a generated latent tensor ⁇ 1534,
  • the generated latent tensor ⁇ 1534 is input to the VG 1536.
  • the VG 1536 is run or executed, obtaining a decoded image or one or more decoded pixels ⁇ 1540.
  • FIG.16 illustrates an example of determining one or more values of respective one or more parameters of the one or more neural networks of the LG 1602. I.e., FIG. 16 illustrates an example of block determine LG parameters 1506 of FIG.15.
  • one or more values of respective one or more parameters of the one or more neural networks of the LG 1602 are determined by means of a training or overfitting process 1600 that comprises one or more training or overfitting iterations.
  • an input ⁇ 1601 is provided to the LG 1602, where the input ⁇ 1601 may comprise spatial coordinates (for example, vertical and horizontal spatial coordinates referring to the vertical and horizontal spatial coordinates of the pixels to be decoded by the decoder).
  • the LG 1602 is run or executed, obtaining a generated latent tensor ⁇ 1604, which is input to the VG 1606.
  • the VG 1606 is run or executed, obtaining a decoded image ⁇ 1608.
  • a loss function is computed 1610, based at least on the decoded image ⁇ 1608 and on a ground-truth image ⁇ 1603 , where the ground-truth image may be, for example an uncompressed image that is to be encoded.
  • the block 1612 uses the loss function to update the parameters of the neural networks of the LG 1602, and comprises determining updates and applying those updates to the parameters.
  • the block 1612 uses the loss function to determine one or more updates to respective one or more parameters of the one or more neural networks of the LG, for example, by means of backpropagation.
  • the one or more updates are used to update the respective one or more parameters of the one or more neural networks of the LG 1602, obtaining updated one or more values.
  • the one or more values of the respective one or more parameters of the one or more neural networks of the LG 1602 are the updated one or more values that were determined at one of the iterations of the overfitting process, such as at the last iteration of the overfitting process.
  • the determined one or more values of respective one or more parameters may represent the output or result of the overfitting process, such as one or more values ⁇ 1509 in FIG.15.
  • the encoder may include at least some of the components of the decoder, including the LG 1602 and/or the VG 1606, which components may be used during the determination of the update values 1614 of the one or more parameters of the one or more neural networks.
  • the one or more neural networks in the VG are trained during an offline stage 1700, as follows.
  • offline stage indicates a phase or time that occurs prior to the utilization of the encoder or decoder, or anyway prior to the time when the result of the training (e.g., the trained neural networks comprised in the VG) is to be used in the encoder or decoder.
  • the VG comprises one neural network, referred to as a second NN 1704 or second NN 1704.
  • the second NN 1704 is trained together with a first NN 1702, also referred to as first NN 1702.
  • An input 1701 such as a training image (or a batch of training images) from a training dataset, is provided to the first NN 1702.
  • An output 1703 of the first NN 1702 is provided as input to the second NN 1704.
  • An output 1706 from the second NN 1704 is obtained.
  • a loss function is computed 1708 based on the input 1701 to the first NN 1702 and the output 1706 from the second NN 1704.
  • the loss function is used to perform one iteration of training for the first NN 1702 and the second NN 1704, for example by means of gradient descent and backpropagation techniques.
  • One or more training iterations are performed by using one or more training images (or batches of training images) in the training dataset, until a stopping criterion is satisfied.
  • a compression objective may be used during the training.
  • the compression objective comprises a loss function computed based at least on the output of the first NN or based at least on data derived from the output of the first NN.
  • the loss function may estimate an entropy or cross- entropy of the output of the first NN.
  • the loss function may estimate a bitrate of a bitstream that represents the output of the first NN.
  • the loss function may be derived from an output of a learned probability model (another neural network that takes as input the output of the first NN or data derived from the output of the first NN). See FIG.18 for an example of this embodiment.
  • a first loss function 1810 is computed based at least on an output ⁇ 1808 of the second NN 1806 and on an input ⁇ 1802 to the first NN 1804;
  • a second loss function 1812 is computed based at least on an output ⁇ 1805 of the first NN 1804.
  • the first loss function 1810 and the second loss function 1812 are used to train 1814 the first NN 1804 via updated parameters 1816 and the second NN 1806 via updated parameters 1818.
  • the VG is trained offline and the LG is overfitted online (and thus values of at least some of its parameters are determined by an encoder and sent to the decoder), where offline indicates a phase or time that occurs prior to the utilization of the encoder or decoder, or anyway prior to the time when the result of the training (e.g., the trained neural networks comprised in the VG) is used in the encoder or decoder for encoding a data item, and where online refers to a time when the codec is used for encoding/decoding an input data item such as an image, e.g., inference time or test time.
  • a decoder comprises a LG and a VG, the VG is trained offline, the LG is overfitted/trained online, where the encoder determines values of at least some of the parameters of the LG and sends them to the decoder to be assigned to respective parameters of the LG.
  • Training features for the LG [0239]
  • the one or more neural networks of the LG are not trained or pretrained in an offline stage. In one example, they are randomly initialized, e.g., the values of the parameters of the one or more neural networks of the LG are initialized according to a (pseudo-)random sampling of a probability distribution.
  • the one or more neural networks of the LG are trained or pretrained in an offline stage.
  • the one or more NNs of the LG are trained jointly with the one or more NNs of the VG.
  • the VG comprises one neural network, referred to as a second NN 1910 or second NN 1910; also, let us assume that the LG comprises one neural network, referred to as third NN 1908.
  • the second NN 1910 and third NN 1908 are trained together with a first NN 1906, also referred to as first NN 1906.
  • An input 1904 such as a training image (or batch of training images) from a training dataset, is provided to the first NN 1906.
  • An output 1907 of the first NN 1906 is provided as input to the third NN 1908.
  • An output 1909 from the third NN 1908 is obtained.
  • the output 1909 of the third NN 1908 is provided as input to the second NN 1910.
  • An output 1912 from the second NN 1910 is obtained.
  • a loss function is computed 1914 based on the input 1904 to the first NN 1906 and the output 1912 from the second NN 1910.
  • the loss function is used to perform one iteration of training 1916 for the first NN 1906, the second NN 1910 and the third NN 1908, for example by means of gradient descent and backpropagation techniques.
  • One or more training iterations are performed by using one or more training images (or batches of training images) in the training dataset, until a stopping criterion is satisfied.
  • Training 1916 provides an update 1918 to the first NN 1906, an update 1920 to the second NN 1910, and an update 1922 to the third NN 1908.
  • the one or more NNs of the LG are trained after the one or more NNs of the VG have been trained and after a first NN has been trained.
  • the VG comprises one neural network, referred to as a pretrained second NN 2010 and the LG comprises one neural network, referred to as third NN 2008.
  • the pretrained first NN 2006 and the pretrained second NN 2010 have been trained, they may be referred to as pretrained first NN 2006 and pretrained second NN 2010, respectively.
  • An output 2007 of the pretrained first NN 2006 is provided as input to the third NN 2008.
  • An output 2009 from the third NN 2008 is obtained.
  • the output 2009 of the third NN 2008 is provided as input to the pretrained second NN 2010.
  • An output 2012 from the pretrained second NN 2010 is obtained.
  • a loss function 2014 is computed based on the input 2004 to the pretrained first NN 2006 and the output 2012 from the pretrained second NN 2010.
  • the loss function 2014 is used to perform one iteration of training 2016 for the third NN 2008, for example by means of gradient descent and backpropagation techniques.
  • One or more iterations of training 2016 are performed by using one or more training images (or batches of training images) in the training dataset, until a stopping criterion is satisfied.
  • the one or more iterations of training 2016 is used to provide an update 2022 to third NN 2008 of the LG.
  • the one or more NNs of the VG are trained, then the one or more NNs of the LG are trained jointly with a finetuning of the one or more NNs of the VG.
  • the one or more neural networks of the LG and, optionally, the one or more neural networks in the VG are trained or pretrained in an offline stage by means of a meta-learning algorithm, such as an algorithm that is based on the Model- Agnostic Meta Learning (MAML) algorithm.
  • a meta-learning algorithm such as an algorithm that is based on the Model- Agnostic Meta Learning (MAML) algorithm.
  • a training iteration comprises one or more inner training iterations and an outer training iteration, wherein an inner training iteration comprises overfitting the NNs in the LG based on an image and computing an overfitting performance score, and wherein the outer training iteration comprises training at least the NNs in the LG and, optionally, the NNs in the VG, based at least on a loss function that is computed based at least on one or more performance scores computed during the respective one or more inner training iterations.
  • Finetuning features for LG and VG
  • the one or more NNs in the LG may be finetuned at test time.
  • An encoder determines one or more update values for respective one or more parameters of the one or more neural networks in the LG, encodes them and sends the encoded one or more update values to the decoder.
  • the encoded one or more update values are decoded, obtaining decoded one or more update values.
  • the decoded one or more update values are used to update the respective one or more parameters of the one or more neural networks of the LG.
  • the one or more NNs in the VG may be finetuned at test time.
  • An encoder determines one or more update values for respective one or more parameters of the one or more neural networks in the VG, encodes them and sends the encoded one or more update values to the decoder.
  • the encoded one or more update values are decoded, obtaining decoded one or more update values.
  • the decoded one or more update values are used to update the respective one or more parameters of the one or more neural networks of the VG.
  • FIG.21 shows a layout of an apparatus 50 according to an example embodiment.
  • the apparatus 50 may for example be a mobile terminal or user equipment of a wireless communication system, a sensor device, a tag, or other lower power device. However, the embodiments of the examples described herein may be implemented within any electronic device or apparatus which may encode or decode multimedia content.
  • the apparatus 50 may comprise a housing 30 for incorporating and protecting the device.
  • the apparatus 50 further may comprise a display 32 in the form of a liquid crystal display.
  • the display may be any suitable display technology suitable to display an image or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analog signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the examples described herein may be any one of: an earpiece 38, speaker, or an analog audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery (or in other embodiments of the examples described herein the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera capable of recording or capturing images and/or video.
  • the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices.
  • the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/firewire wired connection.
  • learned coding 60 may implement the examples described herein related to end-to-end learned coding via overfitting a latent generator.
  • FIG. 22 is a block diagram illustrating a system 2200 in accordance with several examples.
  • the encoder 2230 is used to encode an image or video from the scene 2215, and the encoder 2230 is implemented in a transmitting apparatus 2280.
  • the encoder 2230 produces a bitstream 2210 comprising signaling that is received by the receiving apparatus 2282, which implements a decoder 2240.
  • the encoder 2230 sends the bitstream 2210 that comprises the herein described signaling.
  • the decoder 2240 forms the image or video for the scene 2215-1, and the receiving apparatus 2282 would present this to the user, e.g., via a smartphone, television, or projector among many other options.
  • the transmitting apparatus 2280 and the receiving apparatus 2282 are at least partially within a common apparatus, and for example are located within a common housing 2250. In other examples the transmitting apparatus 2280 and the receiving apparatus 2282 are at least partially not within a common apparatus and have at least partially different housings. Therefore in some examples, the encoder 2230 and the decoder 2240 are at least partially within a common apparatus, and for example are located within a common housing 2250.
  • the common apparatus comprising the encoder 2230 and decoder 2240 implements a codec.
  • the encoder 2230 and the decoder 2240 are at least partially not within a common apparatus and have at least partially different housings, but when together still implement a codec.
  • 3D media from the capture e.g., volumetric capture
  • a viewpoint 2212 of the scene 2215 which includes a person 2213
  • 3D media from the capture is converted via projection to a series of 2D representations with occupancy, geometry, attributes and/or displacements.
  • Additional atlas information is also included in the bitstream to enable inverse reconstruction.
  • the received bitstream 2210 is separated into its components with atlas information; occupancy, geometry, displacement, and attribute 2D representations.
  • FIG. 23 is an example apparatus 2300, which may be implemented in hardware, configured to implement the examples described herein.
  • the apparatus 2300 comprises at least one processor 2302 (e.g., an FPGA and/or CPU), one or more memories 2304 including computer program code 2305, the computer program code 2305 having instructions to carry out the methods described herein, wherein the one or more memories 2304 and the computer program code 2305 are configured to, with the at least one processor 2302, cause the apparatus 2300 to implement circuitry, a process, component, module, or function (implemented with control module 2306) to implement the examples described herein.
  • Apparatus 2300 may be a smartphone, personal digital device or assistant, smart television, laptop, tablet, head-mounted display (HMD) or other user device or terminal device.
  • HMD head-mounted display
  • the memory 2304 may be a non-transitory memory, a transitory memory, a volatile memory (e.g. RAM), or a non-volatile memory (e.g., ROM).
  • Learned coding 2330 implements the examples described herein related to end-to- end learned coding via overfitting a latent generator.
  • the apparatus 2300 includes a display and/or I/O interface 2308, which includes user interface (UI) circuitry and elements, that may be used to display features or a status of the methods described herein (e.g., as one of the methods is being performed or at a subsequent time), or to receive input from a user such as with using a keypad, camera, touchscreen, touch area, microphone, biometric recognition, one or more sensors, etc.
  • the apparatus 2300 includes one or more communication e.g. network (N/W) interfaces (I/F(s)) 2310.
  • the communication I/F(s) 2310 may be wired and/or wireless and communicate over the Internet/other network(s) via any communication technique including via one or more links 2324.
  • the communication I/F(s) 2310 may comprise one or more transmitters or one or more receivers.
  • the transceiver 2316 comprises one or more transmitters 2318 and one or more receivers 2320.
  • the transceiver 2316 and/or communication I/F(s) 2310 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de)modulator, and encoder/decoder circuitries and one or more antennas, such as antennas 2314 used for communication over wireless link 2326.
  • the control module 2306 of the apparatus 2300 comprises one of or both parts 2306-1 and/or 2306-2, which may be implemented in a number of ways.
  • the control module 2306 may be implemented in hardware as control module 2306-1, such as being implemented as part of the at least one processor 2302.
  • the control module 2306-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array.
  • the control module 2306 may be implemented as control module 2306-2, which is implemented as computer program code (having corresponding instructions) 2305 and is executed by the at least one processor 2302.
  • the one or more memories 2304 store instructions that, when executed by the at least one processor 2302, cause the apparatus 2300 to perform one or more of the operations as described herein.
  • the at least one processor 2302, one or more memories 2304, and example algorithms are means for causing performance of the operations described herein.
  • the apparatus 2300 to implement the functionality of control 2306 may correspond to any of the apparatuses depicted herein.
  • apparatus 2300 and its elements may not correspond to any of the other apparatuses depicted herein, as apparatus 2300 may be part of a self-organizing/optimizing network (SON) node or other node, such as a node in a cloud.
  • SON self-organizing/optimizing network
  • the apparatus 2300 may also be distributed throughout the network including within and between apparatus 2300 and any network element (such as a base station and/or terminal device and/or user equipment).
  • Interface 2312 enables data communication and signaling between the various items of apparatus 2300, as shown in FIG.23.
  • the interface 2312 may be one or more buses such as address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like.
  • Computer program code (e.g. instructions) 2305, including control 2306 may comprise object-oriented software configured to pass data or messages between objects within computer program code 2305.
  • Computer program code e.g.
  • FIG. 24 shows a schematic representation of non-volatile memory media 2400a (e.g. computer/compact disc (CD) or digital versatile disc (DVD)) and 2400b (e.g. universal serial bus (USB) memory stick) and 2400c (e.g.
  • non-volatile memory media 2400a e.g. computer/compact disc (CD) or digital versatile disc (DVD)
  • 2400b e.g. universal serial bus (USB) memory stick
  • 2400c e.g.
  • FIG. 25 shows an encoder 2500 according to an embodiment.
  • Learned coding 2530 implements the examples described herein.
  • FIG.26 shows a decoder 2600 according to an embodiment.
  • FIG.26 illustrates a predicted representation of an image block (P′ n ), a reconstructed prediction error signal (D′ n ), a preliminary reconstructed image (I′ n ), a final reconstructed image (R′ n ), an inverse transform (T ⁇ 1 ), an inverse quantization (Q ⁇ 1 ), an entropy decoding (E 1 ), a reference frame memory (RFM), a prediction (either inter or intra) (P), and filtering (F).
  • Learned coding 2630 implements the examples described herein.
  • FIG. 27 is an example method 2700, based on the examples described herein.
  • the method includes executing a latent generator of a decoder using a first input to generate a generated output.
  • the method includes executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
  • Method 2700 may be performed with decoder 116, decode picture buffer block or circuit 122, decoder 204, VCM decoder 311, decoder 500, decoder 602, decoder 714, decoder 802, decoder 900, decoder 1010, decoder 1101, decoder 1201, decoder 1301, decoder 1401, decoder 1522, apparatus 50, receiving apparatus 2282 with decoder 2240, apparatus 2300, or decoder 2600.
  • FIG. 28 is an example method 2800, based on the examples described herein. At 2810, the method includes encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data.
  • the method includes providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
  • Method 2800 may be performed with encoder 114, encoder parameter control block or circuit 108, encoder 202, VCM encoder 304, encoder 400, encoder 704, encoder 1002, encoder 1502, apparatus 50, transmitting apparatus 2280 with encoder 2230, apparatus 2300, or encoder 2500.
  • FIG.29 is an example method 2900, based on the examples described herein.
  • the method includes receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder.
  • the method includes decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
  • Method 2900 may be performed with decoder 116, decode picture buffer block or circuit 122, VCM decoder 311, decoder 500, decoder 602, decoder 714, decoder 802, decoder 900, decoder 1010, decoder 1101, decoder 1201, decoder 1301, decoder 1401, decoder 1522, apparatus 50, receiving apparatus 2282 with decoder 2240, apparatus 2300, or decoder 2600.
  • FIG.30 is an example method 3000, based on the examples described herein.
  • the method includes determining one or more values of respective one or more parameters of one or more neural networks of a latent generator.
  • the method includes wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator.
  • the method includes encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values.
  • the method includes providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
  • Method 3000 may be performed with encoder 114, encoder parameter control block or circuit 108, encoder 202, VCM encoder 304, encoder 400, encoder 704, encoder 1002, encoder 1502, apparatus 50, transmitting apparatus 2280 with encoder 2230, apparatus 2300, or encoder 2500.
  • An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: execute a latent generator of a decoder using a first input to generate a generated output; and execute a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
  • the instructions when executed by the at least one processor, cause the apparatus at least to: receive one or more indicators of respective one or more portions of data to be decoded by the decoder; execute the latent generator using the first input to obtain the generated output, wherein the first input comprises the one or more indicators of respective one or more portions of data to be decoded by the decoder; and execute the media generator using the second input to obtain decoded one or more portions of data, wherein the second input to the media generator comprises the generated output, and wherein the decoded media data comprises the decoded one or more portions of data, and wherein the decoded one or more portions of data correspond to or are indicated by the one or more indicators.
  • the instructions when executed by the at least one processor, cause the apparatus at least to: receive encoded spatial and/or temporal locations comprising information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image; decode the encoded spatial and/or temporal locations to obtain the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; wherein the first input to the latent generator comprises the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; execute the latent generator using the first input to obtain a generated tensor, wherein the generated output comprises the generated tensor; and execute the media generator using the second input to obtain one or more decoded pixels or decoded blocks, wherein the second input to the media generator comprises the generated output, and wherein the decoded media data comprises the one or more decoded pixels or decoded blocks, and wherein the one or more decoded pixels or decoded blocks correspond to or are identified by the one or more
  • Example 4 The apparatus of example 3, wherein the information about the one or more temporal locations of the respective one or more pixels or blocks of the image comprises one or more frame indexes.
  • Example 5 The apparatus of any of examples 1 to 4, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive a bitstream comprising a lossless encoding of an input latent tensor; and perform a lossless decoding of the bitstream to obtain a decoded input latent tensor; wherein the first input to the latent generator comprises the decoded input latent tensor.
  • Example 6 Example 6
  • Example 6 wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: assign the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder to the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
  • Example 8 The apparatus of any of examples 1 to 7, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: derive a signal from one or more previous outputs of the latent generator, if it is available; wherein the first input to the latent generator comprises the signal derived from the one or more previous outputs of the latent generator, if it is available.
  • Example 9 Example 9
  • the instructions when executed by the at least one processor, cause the apparatus at least to: execute, at a first iteration, the latent generator using the first input to generate a first generated latent tensor, wherein the generated output comprises the first generated latent tensor; execute, at the first iteration, the media generator using as the second input the first generated latent tensor to generate a first decoded frame, wherein the decoded media data comprise the first decoded frame; execute, at a second iteration subsequent to the first iteration, the latent generator to generate a second generated latent tensor, wherein the first input to the latent generator comprises the first generated latent tensor, wherein the generated output comprise the second generated latent tensor; and execute, at the second iteration subsequent to the first iteration, the media generator to generate a second decoded frame that is subsequent to the first decoded frame, wherein the second input comprises the second generated latent tensor, wherein the second input comprises the second generated la
  • Example 10 The apparatus of example 9, wherein at the first iteration, the first input to the latent generator comprises a tensor with zero-valued elements.
  • Example 11 The apparatus of any of examples 9 to 10, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: store, in a buffer, the first generated latent tensor generated with the latent generator at the first iteration; and provide, from the buffer to the latent generator, the first generated latent tensor for the latent generator to use as part of the first input at the second iteration subsequent to the first iteration to generate the second generated latent tensor.
  • Example 13 The apparatus of any of examples 1 to 11, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: combine, using a combination operation, the generated output of the latent generator with a combination signal to generate a combined generated output; and execute the media generator using the second input to obtain the decoded media data, wherein the second input comprises the combined generated output.
  • Example 13 The apparatus of example 12, wherein: the generated output of the latent generator comprises a prediction, and the combination signal comprises a residual.
  • Example 14 The apparatus of example 13, wherein the residual comprises an output of a neural network.
  • Example 15 Example 15
  • Example 16 The apparatus of any of examples 12 to 15, wherein: the generated output of the latent generator comprises a residual, and the combination signal comprises a prediction.
  • Example 17 The apparatus of any of examples 1 to 16, wherein the first input to the latent generator comprises a signal that indicates one or more viewpoints or viewing directions of an item of data. [0290] Example 18.
  • Example 19 The apparatus of any of examples 1 to 18, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: apply one or more neural networks of the latent generator during the execution of the latent generator using the first input to generate the generated output. [0292] Example 20.
  • Example 21 The apparatus of example 20, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: initialize one or more values of one or more parameters of the one or more neural networks of the latent generator based on a random or pseudo-random sampling of a probability distribution.
  • Example 22 Example 22.
  • Example 23 The apparatus of any of examples 1 to 21, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: derive a signal from one or more previous outputs of the latent generator generated with the latent generator prior to generating the generated output, if it is available; wherein the media generator is executed using the signal derived from the one or more previous outputs of the latent generator generated with the latent generator prior to generating the generated output, if it is available.
  • Example 24 The apparatus of any of examples 1 to 23, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: execute the media generator using the decoded media data generated with the media generator, to generate another item of decoded media data.
  • Example 25 Example 25.
  • Example 26 The apparatus of example 25, wherein a first neural network and the one or more neural networks of the media generator are trained during an offline or development phase with one or more training iterations until a stopping criterion is satisfied, wherein at least one of the one or more training iterations performed during the offline or development phase comprises: executing the first neural network using an input to the first neural network to generate an output of the first neural network; executing the one or more neural networks of the media generator using as input the output of the first neural network, to generate an output of one or more neural networks of the media generator; computing a loss function based on the input to the first neural network and the output of the one or more neural networks of the media generator; and updating at least one value of at least one parameter of the first neural network and at least one value of at least one parameter of the one or more neural networks of the media generator, based at least on the loss function.
  • Example 27 The apparatus of example 26, wherein the one or more training iterations performed during the offline or development phase used to train the first neural network and the one or more neural networks of the media generator are performed before both of: executing the latent generator of the decoder using the first input to generate the generated output, and executing the media generator of the decoder using the second input comprising the generated output or the data derived from the generated output to produce the decoded media data.
  • Example 28 The apparatus of any of examples 1 to 27, wherein the decoded media data comprises decoded visual data or decoded audio data.
  • Example 29 Example 29.
  • the latent generator comprises one or more latent generator neural networks and the media generator comprises one or more media generator neural networks, wherein the one or more media generator neural networks are trained during an offline or development phase, and wherein one or more values or update values of respective one or more parameters of the one or more latent generator neural networks are received by the decoder during an online phase or inference phase and are assigned or applied to the respective one or more parameters of the one or more latent generator neural networks.
  • An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: encode, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and provide, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
  • Example 32 The apparatus of example 31, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image, based on an input; encode the information about one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image to obtain encoded spatial and/or temporal locations; and provide, to a decoder, the encoded spatial and/or temporal locations comprising information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image.
  • Example 32 The apparatus of example 31, wherein the information about the one or more temporal locations of the respective one or more pixels or blocks of the image comprises one or more frame indexes.
  • Example 33 The apparatus of any of examples 30 to 32, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode, using the encoder neural network, information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image to generate a latent tensor; wherein the input comprising media data comprises the information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image, and the output comprising the encoded media data comprises the latent tensor; wherein the latent tensor comprises the encoded information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; and perform lossless encoding of the latent tensor comprising the encoded information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image to obtain a bitstream.
  • Example 34 The apparatus of any of examples 30 to 33, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encode information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and provide, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
  • Example 35 The apparatus of example 34, wherein the training or finetuning or overfitting process used to determine the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator comprises one or more training or finetuning or overfitting iterations, wherein at least one of the one or more training or finetuning or overfitting iterations comprises: providing an input to the latent generator, wherein the input to the latent generator comprises information related to a spatial and/or temporal location of an image or other media data; executing the latent generator using the input to the latent generator to obtain a training latent tensor; executing the media generator using the training latent tensor as input to obtain a training decoded image or other training decoded media data; computing a loss function, based on the training decoded image or other training decoded media data and ground-truth media data; determining one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the
  • Example 36 The apparatus of example 35, wherein the ground-truth media data comprises an input to the encoder comprising media data.
  • Example 37 The apparatus of example 35, wherein the encoder comprises at least one or more of: the latent generator, or the media generator, or a component for computing the loss function, or a component for determining the one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, or the decoder.
  • Example 38 The apparatus of any of examples 30 to 37, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode a latent tensor residual into a bitstream.
  • Example 39 Example 39.
  • Example 40 An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: receive encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decode the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
  • Example 41 Example 41.
  • Example 40 wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: execute a latent generator of the decoder using a first input to generate a generated output; and execute a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
  • Example 42 The apparatus of example 41, wherein the first input to the latent generator comprises the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
  • Example 43 Example 43.
  • Example 44 The apparatus of any of examples 40 to 43, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: assign the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder to the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0317] Example 45.
  • An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: determine one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encode information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and provide, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
  • Example 46 The apparatus of example 45, wherein the training or finetuning or overfitting process used to determine the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator comprises one or more training or finetuning or overfitting iterations, wherein at least one of the one or more training or finetuning or overfitting iterations comprises: providing an input to the latent generator, wherein the input to the latent generator comprises information related to a spatial and/or temporal location of an image or other media data; executing the latent generator using the input to the latent generator to obtain a training latent tensor; executing the media generator using the training latent tensor as input to obtain a training decoded image or other training decoded media data; computing a loss function, based on the training decoded image or other training decoded media data and ground-truth media data; determining one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the
  • Example 47 The apparatus of example 46, wherein the apparatus comprises at least one or more of: the latent generator, or the media generator, or a component for computing the loss function, or a component for determining the one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, or the decoder.
  • Example 48 The apparatus of any of examples 45 to 47, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and provide, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
  • Example 49 The apparatus of example 48, wherein the media data comprises visual data or audio data, and the encoded media data comprises encoded visual data or encoded audio data.
  • Example 50 A method including: executing a latent generator of a decoder using a first input to generate a generated output; and executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
  • Example 51 Example 51.
  • a method including: encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
  • Example 52 A method including: receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
  • Example 53 Example 53.
  • a method including: determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
  • An apparatus including: means for executing a latent generator of a decoder using a first input to generate a generated output; and means for executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
  • Example 55 An apparatus including: means for encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and means for providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
  • An apparatus including: means for receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and means for decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
  • An apparatus including: means for determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; means for encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and means for providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
  • Example 58 A computer readable medium including instructions stored thereon for performing at least the following: executing a latent generator of a decoder using a first input to generate a generated output; and executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
  • Example 59 A computer readable medium including instructions stored thereon for performing at least the following: encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
  • Example 60 Example 60.
  • a computer readable medium including instructions stored thereon for performing at least the following: receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
  • a computer readable medium including instructions stored thereon for performing at least the following: determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
  • references to a ‘computer’, ‘processor’, etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential /parallel architectures but also specialized circuits such as field- programmable gate arrays (FPGAs), application specific circuits (ASICs), signal processing devices and other processing circuitry.
  • References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device such as instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device, etc.
  • non-transitory is a limitation of the medium itself (i.e., tangible, not a signal ) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).
  • circuitry may refer to any of the following: (a) hardware circuit implementations, such as implementations in analog and/or digital circuitry, and (b) combinations of circuits and software (and/or firmware), such as (as applicable): (i) a combination of processor(s) or (ii) portions of processor(s)/software including digital signal processor(s), software, and one or more memories that work together to cause an apparatus to perform various functions, and (c) circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even when the software or firmware is not physically present.
  • circuitry would also cover an implementation of merely a processor (or multiple processors) or a portion of a processor and its (or their) accompanying software and/or firmware.
  • circuitry would also cover, for example and when applicable to the particular element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, or another network device. Circuitry or circuit may also be used to mean a function or a process used to execute a method.
  • H.263, H.264, H.265, H.266, H.274) HEVC high efficiency video coding HMD head-mounted display IBC intra block copy IEC Internatinoal Electrotechnical Commission I/F interface Inv inverse I/O input/output ISO International Organization for Standardization ITU International Telecommunication Union ITU-T ITU Telecommunication Standardization Sector L0 norm number of nonzero elements in the vector L1 norm sum of the absolute vector values L2 norm square root of the sum of the squared vector values LG latent generator MAE mean absolute error MAML model-agnostic meta learning mAP mean average precision MC motion compensation ME motion estimation MPEG moving picture experts group MSE mean squared error MS-SSIM multi-scale structural similarity NAL network abstraction layer NN neural network NNC neural network coding N/W network par parameter (e.g.
  • pred prediction e.g. ⁇ ⁇
  • ReLU rectified linear unit res residual e.g. ⁇ ⁇ ) ROI region of interest ROM read only memory
  • SEI supplemental enhancement information SGD stochastic gradient descent SON self-organizing/optimizing network SSIM structural similarity TV television UI user interface USB universal serial bus VCM video coding for machines VG visual generator VMAF video multimethod assessment fusion VSEI versatile supplemental enhancement information VVC versatile video coding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

An example apparatus includes: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: execute a latent generator of a decoder using a first input to generate a generated output; and execute a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.

Description

END-TO-END LEARNED CODING VIA OVERFITTING A LATENT GENERATOR TECHNICAL FIELD [0001] The examples and non-limiting embodiments relate generally to multimedia transport and, more particularly, to end-to-end learned coding via overfitting a latent generator. BACKGROUND [0002] It is known to perform data compression and data decompression in a multimedia system. BRIEF DESCRIPTION OF THE DRAWINGS [0003] The foregoing embodiments and other features are explained in the following description, taken in connection with the accompanying drawings, wherein: [0004] FIG.1 illustrates an example of a modified video coding pipeline based on neural networks. [0005] FIG.2 shows an example of an end-to-end learned codec. [0006] FIG.3 shows a system pipeline for video coding for machine (VCM). [0007] FIG.4 illustrates an example of encoder-side operations. [0008] FIG.5 illustrates an example of decoder or receiver side operations. [0009] FIG.6 shows an example of a decoder. [0010] FIG.7 shows an example of an encoder and a decoder. [0011] FIG. 8 shows an example where an input to a latent generator (LG) comprises a latent tensor that was generated by the LG at a previous iteration. [0012] FIG. 9 shows an example where a generated latent tensor prediction is combined with a latent tensor residual using a combination operation. [0013] FIG. 10 shows an example where a latent tensor residual is derived from a signal received from an encoder. [0014] FIG.11 shows an example where a generated latent tensor residual that is an output of the LG is combined with a latent tensor prediction using a combination operation. [0015] FIG.12 shows an example where a latent tensor prediction is an output of a neural network. [0016] FIG.13 shows an example where an input to a visual generator (VG) comprises a generated latent tensor that was output by the LG at a previous iteration, if it is available, and a latent tensor that is output by the LG at a current iteration. [0017] FIG. 14 shows an example where at each iteration, an input to the VG comprises an output of the VG at a previous iteration, when it is available, and a generated latent tensor that is output by the LG at a current iteration. [0018] FIG. 15 shows an example of a codec where an encoder of the codec determines and encodes one or more spatial coordinates and one or more parameter values based on input data and where a decoder of the codec decodes and uses the one or more spatial coordinates and the one or more parameter values for decoding data. [0019] FIG.16 shows an example of determining one or more values of respective one or more parameters of the one or more neural networks of the LG of the decoder. [0020] FIG. 17 shows an example of training one or more neural networks in the VG of the decoder. [0021] FIG.18 shows an example of using a compression objective during the training of the VG of the decoder. [0022] FIG. 19 shows an example of training one or more neural networks (NNs) of the LG jointly with one or more NNs of the VG. [0023] FIG.20 shows an example where one or more NNs of the LG are trained after the one or more NNs of the VG have been trained and after a first NN has been trained. [0024] FIG. 21 shows schematically a user equipment suitable for employing embodiments of the examples described herein. [0025] FIG.22 is a block diagram illustrating a system in accordance with an example. [0026] FIG.23 is an example apparatus configured to implement the examples described herein. [0027] FIG.24 shows a representation of an example of non-volatile memory media used to store instructions that implement the examples described herein. [0028] FIG.25 shows an encoder according to an embodiment. [0029] FIG.26 shows a decoder according to an embodiment. [0030] FIG.27 is an example method, based on the examples described herein. [0031] FIG.28 is an example method, based on the examples described herein. [0032] FIG.29 is an example method, based on the examples described herein. [0033] FIG.30 is an example method, based on the examples described herein. DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS [0034] Fundamentals of neural networks [0035] A neural network (NN) may be described as a computation graph consisting of several layers of computation. Each layer may consist of one or more units, where each unit performs an elementary computation. A unit is connected to one or more other units, and the connection may be associated with a weight. The weight may be used for scaling the signal passing through the associated connection. Weights are learnable parameters, i.e., values which can be learned from training data. There may be other learnable parameters, such as those of batch-normalization layers. [0036] In some neural networks, such as convolutional neural networks for image classification, initial layers (those close to the input data) extract semantically low-level features such as edges and textures in images, whereas intermediate layers extract more high-level features. After the feature extraction layers there may be one or more layers performing a certain task, such as classification, semantic segmentation, object detection, denoising, style transfer, super-resolution, etc. [0037] Neural networks are being utilized in an ever-increasing number of applications for many different types of devices, such as mobile phones. Examples include image and video analysis and processing, social media data analysis, device usage data analysis, etc. [0038] One property of neural nets (and other machine learning tools) is that they are able to learn properties from input data, e.g., in supervised way or in unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing the training signal. [0039] In general, the training algorithm consists of changing some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to. Training usually happens by minimizing or decreasing the output’s error, also referred to as the loss or loss function. Examples of losses are mean squared error, cross-entropy, etc. In recent deep learning techniques, training is an iterative process, where at each iteration the algorithm modifies the weights of the neural net to make a gradual improvement of the network’s output, i.e., to gradually decrease the loss, by means of gradient descent technique. In one example, at each training iteration, gradients of the loss function with respect to one or more weights or parameters of the NN are computed, for example by backpropagation technique; the computed gradients are then used by an optimization routine, such as Adam or Stochastic Gradient Descent (SGD) to obtain an update to the one or more weights or parameters. [0040] The terms “model”, “neural network”, “neural net” and “network” are used interchangeably herein, and also the weights of neural networks are sometimes referred to as learnable parameters or simply as parameters. [0041] Training a neural network is an optimization process, but the final goal may be different from the typical goal of optimization. In optimization, the only goal is to minimize a function. In machine learning, the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e., data which was not used for training the model. This is usually referred to as generalization. In practice, data is usually split into at least two sets, the training set and the validation set. The training set is used for training the network, i.e., to modify its learnable parameters in order to minimize the loss. The validation set is used for checking the performance of the network on data which was not used to minimize the loss, as an indication of the final performance of the model. In particular, the errors on the training set and on the validation set are monitored during the training process to understand the following things: [0042] --If the network is learning at all – in this case, the training set error should decrease, otherwise the model is in the regime of underfitting. [0043] --If the network is learning to generalize – in this case, also the validation set error needs to decrease and to be not too much higher than the training set error. If the training set error is low, but the validation set error is much higher than the training set error, or it does not decrease, or it even increases, the model may be in the regime of overfitting. This means that the model has just memorized the training set’s properties and performs well only on that set, but performs poorly on a set not used for tuning its parameters. [0044] Fundamentals of video/image coding [0045] Video codec consists of an encoder that transforms the input video into a compressed representation suited for storage/transmission and a decoder that can decompress the compressed video representation back into a viewable form. Typically encoder discards some information in the original video sequence in order to represent the video in a more compact form (that is, at lower bitrate). [0046] Typical hybrid video codecs, for example ITU-T H.263 and H.264, encode the video information in two phases. Firstly pixel values in a certain picture area (or “block”) are predicted for example by motion compensation means (finding and indicating an area in one of the previously coded video frames that corresponds closely to the block being coded) or by spatial means (using the pixel values around the block to be coded in a specified manner). Secondly the prediction error, i.e. the difference between the predicted block of pixels and the original block of pixels, is coded. This is typically done by transforming the difference in pixel values using a specified transform (e.g. Discrete Cosine Transform (DCT) or a variant of it), quantizing the coefficients and entropy coding the quantized coefficients. By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (picture quality) and size of the resulting coded video representation (file size or transmission bitrate). [0047] Inter prediction, which may also be referred to as temporal prediction, motion compensation, or motion-compensated prediction, exploits temporal redundancy. In inter prediction the sources of prediction are previously decoded pictures (also known as reference pictures). [0048] In temporal inter prediction, the sources of prediction are previously decoded pictures in the same scalable layer. In intra block copy (IBC; also known as intra-block- copy prediction), prediction may be applied similarly to temporal inter prediction but the reference picture is the current picture and only previously decoded samples can be referred in the prediction process. Inter-layer or inter-view prediction may be applied similarly to temporal inter prediction, but the reference picture is a decoded picture from another scalable layer or from another view, respectively. In some cases, inter prediction may refer to temporal inter prediction only, while in other cases inter prediction may refer collectively to temporal inter prediction and any of intra block copy, inter-layer prediction, and inter- view prediction provided that they are performed with the same or similar process than temporal prediction. Inter prediction, temporal inter prediction, or temporal prediction may sometimes be referred to as motion compensation or motion-compensated prediction. [0049] Intra prediction utilizes the fact that adjacent pixels within the same picture are likely to be correlated. Intra prediction can be performed in spatial or transform domain, i.e., either sample values or transform coefficients can be predicted. Intra prediction is typically exploited in intra coding, where no inter prediction is applied. [0050] One outcome of the coding procedure is a set of coding parameters, such as motion vectors and quantized transform coefficients. Many parameters can be entropy-coded more efficiently if they are predicted first from spatially or temporally neighboring parameters. For example, a motion vector may be predicted from spatially adjacent motion vectors and only the difference relative to the motion vector predictor may be coded. Prediction of coding parameters and intra prediction may be collectively referred to as in-picture prediction. [0051] The decoder reconstructs the output video by applying prediction means similar to the encoder to form a predicted representation of the pixel blocks (using the motion or spatial information created by the encoder and stored in the compressed representation) and prediction error decoding (inverse operation of the prediction error coding recovering the quantized prediction error signal in spatial pixel domain). After applying prediction and prediction error decoding means the decoder sums up the prediction and prediction error signals (pixel values) to form the output video frame. The decoder (and encoder) can also apply additional filtering means to improve the quality of the output video before passing it for display and/or storing it as prediction reference for the forthcoming frames in the video sequence. [0052] In typical video codecs the motion information is indicated with motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder side) or decoded (in the decoder side) and the prediction source block in one of the previously coded or decoded pictures. In order to represent motion vectors efficiently those are typically coded differentially with respect to block specific predicted motion vectors. In typical video codecs the predicted motion vectors are created in a predefined way, for example calculating the median of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vector predictions is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures and signaling the chosen candidate as the motion vector predictor. In addition to predicting the motion vector values, the reference index of previously coded/decoded picture can be predicted. The reference index is typically predicted from adjacent blocks and/or or co- located blocks in temporal reference picture. Moreover, typical high efficiency video codecs employ an additional motion information coding/decoding mechanism, often called merging/merge mode, where all the motion field information, which includes motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without any modification/correction. Similarly, predicting the motion field information is carried out using the motion field information of adjacent blocks and/or co-located blocks in temporal reference pictures and the used motion field information is signaled among a list of motion field candidate list filled with motion field information of available adjacent/co-located blocks. [0053] In typical video codecs the prediction residual after motion compensation is first transformed with a transform kernel (like DCT) and then coded. The reason for this is that often there still exists some correlation among the residual and transform can in many cases help reduce this correlation and provide more efficient coding. [0054] Typical video encoders utilize Lagrangian cost functions to find optimal coding modes, e.g. the desired Macroblock mode and associated motion vectors. This kind of cost function uses a weighting factor λ to tie together the (exact or estimated) image distortion due to lossy coding methods and the (exact or estimated) amount of information that is required to represent the pixel values in an image area: C = D + λR (Equation 1) [0055] In the above equation, C is the Lagrangian cost to be minimized, D is the image distortion (e.g. Mean Squared Error) with the mode and motion vectors considered, and R the number of bits needed to represent the required data to reconstruct the image block in the decoder (including the amount of data to represent the candidate motion vectors). [0056] Video coding specifications may enable the use of supplemental enhancement information (SEI) messages or alike. Some video coding specifications include SEI NAL units, and some video coding specifications contain both prefix SEI NAL units and suffix SEI NAL units, where the former type can start a picture unit or alike and the latter type can end a picture unit or alike. An SEI NAL unit contains one or more SEI messages, which are not required for the decoding of output pictures but may assist in related processes, such as picture output timing, post-processing of decoded pictures, rendering, error detection, error concealment, and resource reservation. Several SEI messages are specified in H.264/AVC, H.265/HEVC, H.266/VVC, and H.274/VSEI standards, and the user data SEI messages enable organizations and companies to specify SEI messages for their own use. The standards may contain the syntax and semantics for the specified SEI messages but a process for handling the messages in the recipient might not be defined. Consequently, encoders may be required to follow the standard specifying a SEI message when they create SEI message(s), and decoders might not be required to process SEI messages for output order conformance. One of the reasons to include the syntax and semantics of SEI messages in standards is to allow different system specifications to interpret the supplemental information identically and hence interoperate. It is intended that system specifications can require the use of particular SEI messages both in the encoding end and in the decoding end, and additionally the process for handling particular SEI messages in the recipient can be specified. [0057] Information on neural network based image/video coding [0058] Recently, neural networks (NNs) have been used in the context of image and video compression, by following mainly two approaches. [0059] In a first approach, NNs are used to replace one or more of the components of a non-learned codec such as a VVC/H.266-compliant codec. Here, “non-learned” means those codecs whose components and their parameters are typically not learned from data by means of machine learning techniques. Examples of components that may be implemented as neural networks are: An in-loop filter, for example a NN that works as an additional in-loop filter with respect to the non-learned loop filters, or a NN that works as the only additional in-loop filter, thus replacing any other in-loop filter; Intra-frame prediction; Inter-frame prediction; Transform and/or inverse transform; Probability model for lossless coding; Etc. [0060] In a second approach, commonly referred to as “end-to-end learned compression” (or end-to-end learned codec), NNs are used as the main components of the image/video codecs. However, the codec may still comprise components which are not based on machine learning techniques. In this second approach, two design options are as follows: [0061] Option 1: re-use the non-learned video coding pipeline, but replace most or all the components with NNs, as shown in FIG.1. [0062] Referring to FIG. 1, it illustrates an example of modified video coding pipeline based on neural networks. An example of neural network may include, but is not limited, a compressed representation of a neural network. FIG. 1 is shown to include following components: [0063] – A neural transform block or circuit 102: this block or circuit transforms the output of a summation/subtraction operation 103 to a new representation of that data, which may have lower entropy and thus be more compressible. [0064] – A quantization block or circuit 104: this block or circuit quantizes an input data 101 to a smaller set of possible values. [0065] – An inverse transform and inverse quantization blocks or circuits 106. These blocks or circuits perform the inverse or approximately inverse operation of the transform and the quantization, respectively. [0066] – An encoder parameter control block or circuit 108. This block or circuit may control and optimize some or all the parameters of the encoding process, such as parameters of one or more of the encoding blocks or circuits. [0067] – An entropy coding block or circuit 110. This block or circuit may perform lossless coding, for example, based on entropy. One popular entropy coding technique is arithmetic coding. [0068] – A neural intra-codec block or circuit 112. This block or circuit may be an image compression and decompression block or circuit, which may be used to encode and decode an intra frame. An encoder 114 may be an encoder block or circuit, such as the neural encoder part of an auto-encoder neural network. A decoder 116 may be a decoder block or circuit, such as the neural decoder part of an auto-encoder neural network. An intra-coding block or circuit 118 may be a block or circuit performing some intermediate steps between encoder and decoder, such as quantization, entropy encoding, entropy decoding, and/or inverse quantization. [0069] – A deep loop filter block or circuit 120. This block or circuit performs filtering of reconstructed data, in order to enhance it. [0070] – A decode picture buffer block or circuit 122. This block or circuit is a memory buffer, keeping the decoded frame, for example, reconstructed frames 124 and enhanced reference frames 126 to be used for inter prediction. [0071] – An inter-prediction block or circuit 128. This block or circuit performs inter- frame prediction, for example, predicts from frames, for example, frames 132, which are temporally nearby. An ME/MC 130 performs motion estimation and/or motion compensation, which are two key operations to be performed when performing inter-frame prediction. ME/MC stands for motion estimation / motion compensation. [0072] In this example (Option 1), the forward and inverse transforms were replaced with two neural networks. Also, the loop filter is a neural network. [0073] Option 2 (also referred to as end-to-end learned coding): re-design the whole pipeline as a neural network auto-encoder with a quantization and lossless coding in the middle part, as follows: [0074] – Encoder NN (also referred to as neural network based encoder, or NN encoder): may perform a non-linear transformation of the input. The output is typically referred to as latent tensor. [0075] – Quantization and lossless encoding of the encoder NN’s output. [0076] – Lossless decoding and dequantization. [0077] – Decoder NN (also referred to as neural network based decoder, or NN decoder): may perform a non-linear inverse transformation from dequantized latent tensor to a reconstructed input. [0078] It is to be understood that even in end-to-end learned approaches, there may be components which are not learned from data, such as the arithmetic codec. [0079] More information on option 2 is provided in the following description. [0080] Further information on neural network-based end-to-end learned video coding [0081] FIG. 2 illustrates an example neural network-based end-to-end learned coding, such as an end-to-end learned video or image coding system 200 . Even though some examples are provided with respect to coding images or videos, it is to be understood that other types of data may be coded in a similar way, such as audio, speech, text, features, etc. As shown in FIG. 2, a typical neural network-based end-to-end learned coding system contains an encoder 202 and a decoder 204. [0082] The encoder 202 comprises an encoder NN 206, a quantizer or quantization operation 208, a probability model 210, a lossless encoder 212 (for example arithmetic encoder). The decoder 204 comprises a lossless decoder 214 (for example, an arithmetic decoder), a probability model 216, a dequantizer or dequantization 218, a decoder NN 220. [0083] It is to be noted that the probability model 210 present at encoder side and the probability model 216 present at decoder side may be same or substantially the same. For example, they may be two copies of the same probability model. [0084] The lossless encoder 212 and the lossless decoder 214 form a lossless codec 222. A lossless codec such as lossless codec 222 may be an entropy-based lossless codec. An example of lossless codec is an arithmetic codec, such as a context-adaptive binary arithmetic coding CABAC. [0085] The encoder NN 206 and decoder NN 220 are typically two neural networks, or mainly comprise neural network components. [0086] The probability model (210, 216) may also be a neural network and/or comprise mainly neural network components, and may be referred to as neural network based probability model or learned probability model. [0087] Sometimes, the term lossless codec may refer to a system that comprise also the probability model, in addition to, for example, an arithmetic encoder and an arithmetic decoder. [0088] The quantizer or quantization operation 208, dequantizer or dequantization 218 and lossless codec 222 are typically not based on neural network components, but they may also comprise neural network components, potentially. [0089] The encoder NN 206 takes an input ^ 224, which may comprise, for example, an image to be compressed. The encoder NN 206 outputs a latent tensor ^ 226. In one example, the latent tensor ^ 226 may be a 3D tensor, where the three dimensions of the latent tensor ^ 226 represent a channel dimension, a vertical dimension (also sometimes referred to as height dimension) and a horizontal dimension (also sometimes referred to as width dimension). In another example, the latent tensor ^ 226 may be a 4D tensor, where the four dimensions of the latent tensor ^ 226 represent sample dimension (also sometimes referred to as batch dimension, which is the dimension along which different samples of data can be placed), a channel dimension, a vertical dimension (also sometimes referred to as height dimension) and a horizontal dimension (also sometimes referred to as width dimension). The latent tensor ^ 226 is input to a quantization operation 208, obtaining a quantized latent tensor ^^ 228. The quantized latent tensor ^^ 228 is lossless-encoded into a bitstream ^ 230 by the lossless encoder 212, based also on the output 232 of the probability model 210. In particular, the probability model takes as input at least part of the quantized latent tensor ^^ 228 and outputs 232 an estimate of a probability or an estimate of a probability distribution or an estimate of one or more parameters of a probability distribution for one or more elements of the quantized latent tensor. The bitstream ^ 230 represents an encoded or compressed version of the input ^ 224. [0090] The bitstream ^ 230 is lossless-decoded by the lossless decoder 214 also based on the output 234 of the probability model 216 present at decoder side, obtaining a quantized latent tensor ^^ 236. The quantized latent tensor ^^ 236 is dequantized using the dequantizer or dequantization 218, obtaining a reconstructed latent tensor ^̂ 238. The reconstructed latent tensor ^̂ 238 is input to a decoder NN 220, obtaining a reconstructed input ^^ 240, i.e., a reconstructed version of the input ^ 224. The reconstructed input ^^ 240 may also be referred to as reconstructed data, or reconstruction, or decoded data, or decoded input, or decoded output, or decoded image, or decoded video, and the like. [0091] This is a simplified description of the end-to-end learned video or image coding system 200, and it is to be understood that more sophisticated designs or variations of this design are possible. [0092] It is to be understood that, in some implementations or in some embodiments described herein, the encoder 202 may comprise some or all of the components of the decoder 204, even if the some or all of the components of the decoder 204 are not shown as being part of the encoder 202 in FIG.2. [0093] The neural network components, or a subset of the neural network components, of an end-to-end learned codec (such as the end-to-end learned video or image coding system 200) may be trained by minimizing a rate-distortion loss function: ^ = ^ + ^^ (Equation 2) [0094] In Equation 2, D is a distortion loss term, R is a rate loss term, and ^ is a weight that controls the balance between the two losses. The distortion loss term may be referred to also as reconstruction loss term, or simply reconstruction loss. The rate loss term may be referred to simply as rate loss. [0095] The distortion loss term measures the quality of the reconstructed or decoded output, and may comprise (but may not be limited to) one or more of the following: Mean square error (MSE); Structure similarity (SSIM); MS-SSIM; Losses derived from the use of a pretrained neural network. For example, error(f1, f2), where f1 and f2 are the features extracted by a pretrained neural network for the input data and the decoded data, respectively, and error() is an error or distance function, such as L1 norm or L2 norm; Losses derived from the use of a neural network that is trained simultaneously with the end-to-end learned codec. For example, adversarial loss can be used, which is the loss provided by a discriminator neural network that is trained adversarially with respect to the codec, following the settings in the context of Generative Adversarial Networks (GANs) and their variants; Loss that is related to a performance of one or more machine analysis tasks or to an estimated performance of one or more machine analysis tasks, where the one or more machine analysis tasks may comprise classification, object detection, image segmentation, instance segmentation, etc. In one example, the estimated performance of one or more machine analysis tasks may comprise a distortion computed based at least on a first set of features extracted from an output of the decoder and a second set of features extracted from a respective ground truth data, where the first set of features and the second set of features are output by one or more layers of a pretrained feature-extraction neural network. [0096] Multiple distortion losses may be used and integrated into D, such as a weighted sum of MSE and SSIM. [0097] The rate loss term may be used to train the encoder NN to output a low-entropy latent tensor, or a latent tensor such that the quantized latent tensor has low entropy, or a latent tensor such that the probability distribution of the quantized latent tensor can be better estimated or predicted by the probability model. [0098] The rate loss term may be used to train the probability model to better estimate or predict the probability distribution of the quantized latent tensor. [0099] Examples of the rate loss terms are the following: In one example, the rate loss term is derived from the output of the probability model, and it represents the estimated entropy of the quantized latent representation, which indicates the number of bits necessary to represent the quantized latent tensor; A sparsification loss, i.e., a loss that encourages the quantized latent tensor to comprise many zeros. Examples are L0 norm, L1 norm, L1 norm divided by L2 norm. [0100] In order to train the neural network components, or a subset of the neural network components, of an end-to-end learned codec, one or more reconstruction losses may be used, and one or more rate losses may be used. In one example the one or more reconstruction losses and/or one or more rate losses are combined by means of a weighted sum. Typically, the different loss terms are weighted using different weights, and these weights determine how the final system performs in terms of rate-distortion performance. For example, if more weight is given to the reconstruction losses with respect to the rate losses, the system may learn to compress less but to reconstruct with higher accuracy (as measured by a metric that correlates with the reconstruction losses). These weights are usually considered to be hyper- parameters of the training process, and may be set manually by the person designing the training process, or automatically for example by grid search or by using additional neural networks. [0101] In one case, the training process may be performed jointly with respect to the distortion loss D and the rate loss R. In another case, the training process may be performed in two alternating phases, where in a first phase only the distortion loss D may be used, and in a second phase only the rate loss R may be used. [0102] For lossless video/image compression, the system would comprise only the probability model and lossless encoder and lossless decoder. The loss function would comprise only the rate loss, since the distortion loss is always zero (i.e., no loss of information). [0103] As used herein, the inference phase, or inference stage, or inference time, or test time, are referred to as the phase when a neural network or a codec is used for its purpose, such as encoding and decoding an input image. [0104] Information on Video Coding for Machines (VCM) [0105] Reducing the distortion in image and video compression is often intended to increase human perceptual quality, as humans are considered to be the end users, i.e. consuming/watching the decoded images or videos. Recently, with the advent of machine learning, especially deep learning, there is a rising number of machines (i.e., autonomous agents) that analyze data independently from humans and that may even take decisions based on the analysis results without human intervention. Examples of such analysis are object detection, scene classification, semantic segmentation, video event detection, anomaly detection, pedestrian tracking, etc. For example, such analysis tasks may be performed by neural networks. [0106] It is likely that the device where the analysis takes place has multiple “machines” or neural networks (NNs). These multiple machines may be used in a certain combination which is for example determined by an orchestrator sub-system. The multiple machines may be used for example in succession, based on the output of the previously used machine, and/or in parallel. For example, a video may be analyzed by one machine (NN) for detecting pedestrians, by another machine (another NN) for detecting cars, and by another machine (another NN) for estimating the depth of all the pixels in the frames. [0107] Example use cases and applications are self-driving cars, video surveillance cameras and public safety, smart sensor networks, smart TV and smart advertisement, person re-identification, smart traffic monitoring, drones, etc. In addition to image and video data, automatic analysis and processing is increasingly being performed for other types of data, such as audio, speech, text. [0108] Compressing (and decompressing) data where the end user comprises machines (e.g., neural networks) is commonly referred to as compression or coding for machines. In the case of video data, it is referred to as video compression or coding for machines (VCM). [0109] Compressing for machines may differ from compressing for humans for example with respect to the algorithms and technology used in the codec, or the training losses used to train any neural network components of the codec, or the evaluation methodology of codecs. [0110] It is to be understood that, when considering the case of coding for machines, the terms “receiver-side” or “decoder-side” are used to refer to the physical or entity or device which contains one or more machines, and runs these one or more machines on some encoded and eventually decoded video representation which is encoded by another physical or entity or device, the “encoder-side device”. [0111] FIG.3 is a general illustration of a pipeline 300 of Video Coding for Machines. A VCM encoder 304 encodes the original video 302 into a bitstream 306. A bitrate 310 may be computed 308 from the bitstream 306, as a measure of the size of the bitstream. A VCM decoder 311 decodes the bitstream 306 that was produced by the VCM encoder 304. The output of the VCM decoder 311 is referred in the figure as “Decoded data for machines”, which may also be referred to as data 312. This data 312 may be considered as the decoded or reconstructed video. However, in some implementations of this pipeline 300, this data 312 may not have same or similar characteristics as the original video 302 which was input to the VCM encoder 304. For example, this data 312 may not be easily understandable by a human by simply rendering the data onto a screen. The data 312, which is the output of VCM decoder 311, is then input to one or more task neural networks. In the figure, for the sake of illustrating that there may be any number of task-NNs, there are three example task- NNs (object detection task NN 314, object segmentation task NN 316, object tracking task NN 318), and a non-specified one (Task-NN X 320). One goal of VCM may be to obtain a low bitrate while guaranteeing that the task-NNs (e.g., the object detection task NN 314, the object segmentation task NN 316, the object tracking task NN 318, the Task-NN X 320) still perform well in terms of the evaluation metric associated to each task. [0112] It is to be understood that, in some cases, the VCM decoder 311 may not be present. In one example, the machines are run directly on the bitstream 306. In some other cases, the VCM decoder 311 may comprise only a lossless decoding stage, and the lossless decoded data is provided as input to the machines. In yet some other cases, the VCM decoder 311 may comprise a lossless decoding stage following by a dequantization operation, and the loss-decoded and dequantized data is provided as input to the machines. [0113] As shown in FIG. 3, the performance of the object detection task NN 314 is evaluated 322 to generate task performance 332, the performance of the object segmentation task NN 316 is evaluated 324 to generate task performance 334, the performance of object tracking task NN 318 is evaluated 326 to generate task performance 336, and the performance of task-NN X 320 is evaluated 328 to generate task performance 338. [0114] When a video encoder, such as a H.266/VVC encoder, is used as a VCM encoder, one or more of the following approaches may be used to adapt the encoding to be suitable to machine analysis tasks: [0115] --One or more regions of interest (ROIs) may be detected. An ROI detection method may be used. For example, ROI detection may be performed using a task NN, such as an object detection NN. In some cases, ROI boundaries of a group of pictures or an intra period may be spatially overlaid and rectangular areas may be formed to cover the ROI boundaries. The detected ROIs (or rectangular areas, likewise) may be used in one or more of the following ways: o The quantization parameter (QP) may be adjusted spatially in a manner that ROIs are encoded using finer quantization step size(s) than other regions. For example, QP may be adjusted CTU-wise. o The video is preprocessed to contain only the ROIs, while the other areas are replaced by one or more constant values or removed. o The video is preprocessed so that the areas outside the ROIs are blurred or filtered. o A grid is formed in a manner that a single grid cell covers a ROI. Grid rows or grid columns that contain no ROIs are downsampled as preprocessing to encoding. [0116] --Quantization parameter of the highest temporal sublayer(s) is increased (i.e. coarser quantization is used) when compared to practices for human watchable video. [0117] --The original video is temporally downsampled as preprocessing prior to encoding. A frame rate upsampling method may be used as postprocessing subsequent to decoding, if machine analysis at the original frame rate is desired. [0118] --A filter is used to preprocess the input to the encoder. The filter may be a machine learning based filter, such as a convolutional neural network. [0119] It is to be understood that, in the context of video coding for machines, the terms “machine vision”, “machine vision task”, “machine task”, “machine analysis”, “machine analysis task”, “computer vision”, “computer vision task”, "task network" and “task” may be used interchangeably. [0120] Also, it is to be understood that, in the context of video coding for machines, the terms “machine consumption” and “machine analysis” may be used interchangeably. [0121] Neural network based filtering [0122] A neural network may be used for filtering or processing input data. Such a neural network may be referred to as a neural network based filter, or simply NN filter. A NN filter may comprise one or more neural networks, and/or one or more components that may not be categorized as neural networks. [0123] The purpose of a NN filter may comprise (but may not be limited to) visual enhancement, colorization, upsampling, super-resolution, inpainting, temporal extrapolation, generating content, and the like. [0124] In some video codecs, a neural network may be used as filter in the encoding and decoding loop (also referred to simply as coding loop), and it may be referred to as neural network loop filter, or neural network in-loop filter. The NN loop filter may replace all other loop filters of an existing video codec, or may represent an additional loop filter with respect to the already present loop filters in an existing video codec. [0125] A neural network filter may be used as post-processing filter for a codec, e.g., may be applied to an output of an image or video decoder in order to remove or reduce coding artifacts. [0126] In one example, a codec is a modified VVC/H.266 compliant codec (e.g., a VVC/H.266 compliant codec that has been modified and thus it may not be compliant to the VVC/H.266) that comprises one or more NN loop filters. An input to the one or more NN loop filters may comprise at least a reconstructed block or frames (simply referred to as reconstruction) or data derived from a reconstructed block or frame (e.g., the output of a loop filter). The reconstruction may be obtained based on predicting a block or frame (e.g., by means of intra-frame prediction or inter-frame prediction) and performing residual compensation. The one or more NN loop filters may enhance the quality of at least one of their input, so that a rate-distortion loss is decreased. The rate may indicate a bitrate (estimate or real) of the encoded video. The distortion may indicate a pixel fidelity distortion such as the following: Mean-squared error (MSE); Mean absolute error (MAE); Mean Average Precision (mAP) computed based on the output of a task NN (such as an object detection NN) when the input is the output of the post-processing NN; Other machine task- related metric, for tasks such as object tracking, video activity classification, video anomaly detection, etc. [0127] The enhancement may result into a coding gain, which can be expressed for example in terms of BD-rate or BD-PSNR. [0128] A neural network filter may be used as post-processing filter for a codec, e.g., may be applied to an output of an image or video decoder in order to remove or reduce coding artifacts. In one example, the NN filter is used as a post-processing filter where the input comprises data that is output by or is derived from an output of a non-learned decoder, such as a decoder that is compliant with the VVC/H.266 standard. In another example, the NN filter is used as a post-processing filter where the input comprises data that is output by or is derived from an output of a decoder of an end-to-end learned decoder. [0129] Input to a NN filter [0130] In the case of filtering images, a filter may take as input at least one or more first images to be filtered and may output at least one or more second images, where the one or more second images are the filtered version of the one or more first images. In one example, the filter takes as input one image and outputs one image. In another example, the filter takes as input more than one image and outputs one image. In another example, the filter takes as input more than one image and outputs more than one image. [0131] It is to be understood that a filter may take as input also other data (also referred to as auxiliary data, or extra data) than the data that is to be filtered, such as data that can aid the filter to perform a better filtering than if no auxiliary data was provided as input. In one example, the auxiliary data comprises information about prediction data, and/or information about the picture type, and/or information about the slice type, and/or information about a Quantization Parameter (QP) used for encoding, and/or information about boundary strength, etc. In one example, the filter takes as input one image and other data associated to that image, such as information about the quantization parameter (QP) used for quantizing and/or dequantizing that image, and outputs one image. [0132] Information on overfitting a neural network filter [0133] A NN filter can be adapted at test time based at least on part of the data to be encoded and/or decoded and/or post-processed. [0134] Although, for simplicity, the case of a NN filter is being considered herein, similar adaptation may be performed for other coding tools and/or post-processing tools that are based on neural network technology. In one example, an adaptation or overfitting may be performed on a neural network based intra-frame prediction, or a neural network based inter- frame prediction, or a latent tensor generator, or a visual data generator, or a media generator, or a data generator, etc. [0135] Such operation may be referred to, for example, with one of the following terms, when their meaning is clear from the context: adaptation, content adaptation, overfitting, finetuning, optimization, specialization, and the like. [0136] The NN filter that results from the adaptation process may be referred to, for example, with one of the following terms: adapted filter, content-adapted filter, overfitted filter, finetuned filter, optimized filter, specialized filter, and the like. [0137] The overfitting process may be performed at encoder side based on a training process. The resulting overfitted filter is then used to derive an overfitting signal, or adaptation signal. The adaptation signal may be compressed and then signaled from the encoder to the decoder, in or along a bitstream that represents encoded data, such as an encoded image or video. FIG. 4 illustrates an example of such encoder-side operations of encoder 400. [0138] In this figure, ^^ 401 represents an input to the NN filter 402, ^^ 404 represents an output of the NN filter 402, ^ 405 represents a ground-truth data associated with ^^ 401, “Compute loss” 406 computes a training loss ^ 408 in order to overfit the NN filter 402, “Overfit” 410 uses the training loss ^ 408 to overfit the NN filter 402. As a result of the overfitting process 412, an overfitted NN filter 414 is obtained, which is used, together with the original NN filter 416, to derive 418 an adaptation signal 420. The adaptation signal 420 is compressed by using compression steps 422, and the compressed adaptation signal 424 is signaled 426 to a decoder or receiver. [0139] Referring to FIG. 5, at the decoder side including decoder 500, the overfitting signal, or a signal derived from the compressed adaptation signal 424 such as decompressed adaptation signal 506 resulting from decompression, by using decompression steps 504 of the compressed adaptation signal 424, is used to update by using the updating process 502 the original NN filter 416. The updated NN filter 510 is then used to filter one or more pictures, or one or more blocks. FIG. 5 thus illustrates an example of such decoder 500 or receiver side operations. [0140] The NN filter (e.g., the overfitted NN filter 414) that is obtained from the overfitting process at encoder 400 may be different from the updated NN filter 510 that is obtained from the updating process 502 at decoder side. For example, one reason may be that the adaptation signal 420 may be compressed by using compression steps 422 in a lossy way. Thus, the former NN filter (e.g., the overfitted NN filter 414) may be referred to as overfitted filter or adapted filter (or other similar terms, see above), and the latter NN filter may be referred to as updated NN filter 510. [0141] Overfitting process performed at encoder side [0142] The adaptation process starts with an initial NN filter (e.g., the original NN filter 416, the NN filter 402). It is to be noted that, before the adaptation or overfitting process has started, the original NN filter 416 and the NN filter 402 may be same or substantially same. During the adaptation or overfitting process, the NN filter 402 may be modified, thus may become different from the original NN filter 416. In one example, the initial NN filter (e.g., the original NN filter 416, the NN filter 402) is a pretrained NN filter, which was pretrained during an offline stage on a sufficiently large dataset. In another example, the initial NN filter (e.g., the original NN filter 416, the NN filter 402) is a randomly initialized NN filter. [0143] In the adaptation, one or more parameters of the NN filter 402 may be adapted. Examples of such parameters may include (but may not be limited to) the following: The bias terms of a convolutional neural network; Multiplier parameters, that multiply one or more tensors produced by the NN filter 402, such as one or more feature tensors that are output by respective one or more layers of the NN filter 402; Parameters of the kernels of a convolutional neural network; Parameters of an adapter layer; One or more arrays or tensors that are used as input to respective one or more layers of the NN filter 402. [0144] The adaptation may be performed by means of a training process, e.g., by minimizing a loss function until a stopping criterion is met. The data used for this training process may comprise one or more pictures or blocks of input 401 to the NN filter 402 and associated respective one or more pictures or blocks of ground-truth data 405. In one example where the filter is an in-loop filter, the input to the NN filter 402 is reconstruction data, after prediction and residual compensation; the ground-truth data is the uncompressed data that is given as input to the encoder. In one example where the filter is a post-processing filter, the input to the NN filter 402 is decoded data (e.g., the output of a video decoder); the ground-truth data 405 is the uncompressed data that is given as input to the encoder. [0145] Thus, the “overfitting process” is an iterative process where, at each iteration, the NN filter 402 may change; so, at the beginning, NN filter 402 and original NN filter 416 are the same initial filter; then, with each overfitting iteration, NN filter 402 may change, eventually becoming the overfitted NN filter 414. [0146] The loss function (used with compute loss 406) used during the training process may comprise one or more distortion loss functions (also referred to as reconstruction loss functions) and zero or more rate loss functions. A rate loss function may measure, for example, the cost in terms of bitrate of signaling any adaptation signal, such as updates to the parameters of the NN filter. A distortion loss function may comprise one of MSE, MS- SSIM, VMAF, etc. [0147] Deriving the adaptation signal [0148] The adaptation signal 420 may be derived based on the overfitted NN filter 414 and on the original NN filter 416 (e.g., the NN filter 402 before the overfitting process). [0149] In one example, the adaptation signal 420 comprises an update to one or more parameters of the NN filter (e.g., the overfitted NN filter 414, original NN filter416), e.g., a difference between the values of one or more parameters of the overfitted NN filter 414 and the values of respective one or more parameters of the original NN filter 416. Such update may also be referred to as a weight update, or parameter update. Such update may be computed, for example, by subtracting the values of the adapted parameters (i.e., the parameters of the overfitted NN filter 414) from the corresponding values of the original parameters (i.e., the parameters of the original NN filter 416). [0150] In another example, the adaptation signal 420 comprises the parameters (of the overfitted NN filter 414) that were adapted, also referred to as updated parameters, or adapted parameters, or adapted weights, or overfitted parameters, and the like. [0151] Compression of adaptation signal [0152] In order to keep the size of the adaptation signal low, the adaptation signal 420 may go through one or more compression steps 422, such as sparsification, quantization and lossless coding. [0153] In one example, an encoder that compresses the adaptation signal into a bitstream that is compliant with a neural network compression standard, such as MPEG NNC (ISO/IEC 15938-17), may be used. [0154] Signaling [0155] The compressed adaptation signal 424 may be signaled 426 from encoder to decoder in or along a bitstream that represents encoded image or video data. In one example, the compressed adaptation signal 424 is signaled 426 in an Adaptation Parameter Set (APS) syntax structure of a video coding bitstream. In another example, the compressed adaptation signal 424 is signaled 426 in a Supplemental Enhancement Information (SEI) message of a video coding bitstream. Signaling 426 may comprise also other information which is associated with the compressed adaptation signal 424 and that may be required for correctly parsing and/or decompressing and/or using the compressed adaptation signal 424, such as any quantization parameters. It is to be understood that, in some embodiments or use cases, the compressed adaptation signal 424 may be the only signal or bitstream that is sent from an encoder to a decoder and may represent an encoded image or video. [0156] Decoder or receiver side operations [0157] At decoder side, the compressed adaptation signal 424 is received and decompressed 504. The decompressed adaptation signal 506 may then be used to update (e.g., by using the updating process 502) the original NN filter 416, resulting in updated NN filter 510. In one example, where the adaptation signal (e.g., the compressed adaptation signal 424, the decompressed adaptation signal 506) comprises a weight update, where the weight update comprises one or more updates to respective one or more parameters of the original NN filter 416, the updating process 502 adds the one or more updates to the one or more parameters. In another example, where the adaptation signal (e.g., the compressed adaptation signal 424, the decompressed adaptation signal 506) comprises one or more updated or adapted parameters, the updating process 502 replaces respective one or more parameters of the original NN filter 416 with the one or more updated or adapted parameters. Once the original NN filter 416 has been updated by using the updating process 502 based on the adaptation signal (e.g., the compressed adaptation signal 424, the decompressed adaptation signal 506), the updated NN filter 510 may be used for its purpose. For example, for filtering an input picture or an input block, or for decoding an image. [0158] Terminology [0159] The terms frame, picture and image may be used interchangeably. For example, the input and output to an end-to-end learned codec may be pictures. The input and output of a NN filter may be pictures. It is to be understood that also the term block, when it means a portion of a picture, may be simply referred to as frame or picture or image. In other words, at least some of the embodiments herein, even when described as applied to a picture, may be applicable also to a block, e.g., to a portion of a picture. [0160] The examples described herein address the problem of how to design and train a variation of an end-to-end learned codec that is efficient in terms of at least one of the following: decoding complexity, decoding speed, and/or rate-distortion cost. [0161] General information [0162] For the sake of simplicity, in at least some of the embodiments, image and video are considered as the data types. However, the herein described embodiments can be extended to other types of data such as audio. For simplicity, image and video data may be collectively referred to as visual data, and it is to be understood that visual data may refer to either image data or video data or both. In some embodiments, the term signal, data and tensor may be used interchangeably to indicate an input or an output. [0163] It is to be understood that, while some of the embodiments describe operations performed at an encoder side and some other embodiments describe operations performed at a decoder side, some or all of the decoder-side operations or components may be performed or available also at the encoder side. [0164] Embodiments on general architecture of decoder [0165] In one embodiment, referring to FIG. 6, a decoder 602 comprises at least a latent generator (LG) 604 and a visual generator (VG) 606, where the LG 604 is provided with an input 603, where an input to the VG comprises an output 607 of the LG 604 or a signal derived from an output 607 of the LG 604, and where an output ^^ 610 of the VG 606 is used to derive a final output, where the final output may be a decoded image or decoded video or decoded visual data. In one example, the final output is the output ^^ 610. [0166] In FIG.6, ^ 603 is an input to the LG 604, ^ 607 is an output of the LG 604 and is input to the VG 606, the output ^^ 610 is an output of the VG 606, such as a decoded picture or decoded video. [0167] It is to be understood that an encoder may comprise at least some of the components of the decoder, such as the LG and the VG. [0168] In one embodiment, the LG comprises one or more neural networks, for example a neural network that comprises one or more (but not limited to) the following types of layers or blocks: convolutional layers, residual layers or blocks, ResNet layers or blocks, transformer layers or blocks, attention layers or blocks, or non-linear layers, such as Rectified Linear Units (ReLU). [0169] In one example, the LG comprises an auto-encoder architecture that comprises Transformer layers or blocks. In another example, the LG comprises a convolutional auto- encoder architecture, such as a U-Net architecture. In one embodiment, the VG comprises one or more neural networks, for example a neural network that comprises one or more (but not limited to) the following types of layers or blocks: convolutional layers, residual layers or blocks, ResNet layers or blocks, transformer layers or blocks, attention layers or blocks, non-linear layers, such as Rectified Linear Units (ReLU). [0170] In one example, the VG comprises a convolutional decoder-style neural network, such as a neural network that comprises convolutional layers and it increases the spatial and/or temporal resolution of its input signal, for example by using transpose convolutional layers or upsampling layers or interpolation layers or pixel-shuffle layers. [0171] Embodiments on input to LG [0172] In one embodiment, an input to the LG comprises one or more indicators of respective one or more portions of data to be generated by the LG. [0173] In one embodiment, an input to the LG comprises one or more indicators of respective one or more portions of data to be decoded by the decoder. [0174] In one embodiment, an input to the LG comprises a signal that indicates one or more spatial and/or temporal locations of a tensor that is generated by the LG. [0175] In one embodiment, an input to the LG comprises a signal that indicates one or more spatial and/or temporal locations of a data, such as of an image or video, that is decoded by the decoder. [0176] In one embodiment, an input to the LG comprises a signal that indicates one or more viewpoints (or viewing directions) of a tensor that is generated by the LG. [0177] In one embodiment, an input to the LG comprises a signal that indicates one or more viewpoints (or viewing directions) of a data, such as of a picture or video, that is decoded by the decoder. [0178] In one embodiment, an input to the LG comprises one or more spatial and/or temporal coordinates that indicate respective one or more spatial and/or temporal locations of a tensor that is generated by the LG. [0179] In one embodiment, an input to the LG comprises one or more spatial and/or temporal coordinates that indicate respective one or more spatial and/or temporal locations of a data, such as of an image or video, that is decoded by the decoder. [0180] In one example, an input to the LG comprises one or more coordinates of respective one or more pixels to be decoded by the decoder. [0181] In another example, an input to the LG comprises one or more coordinates or identifiers of respective one or more blocks to be decoded by the decoder, where a block comprises two or more pixels. [0182] In yet another example, an input to the LG comprises one or more identifiers of respective one or more pictures, such as frames of a video, to be decoded by the decoder. [0183] In one embodiment, an input to the LG comprises a noise signal, such as a tensor where the values of the elements of the tensor are randomly sampled based at least on a probability distribution. [0184] In one embodiment, an input to the LG comprises a tensor that may be referred to as input latent tensor. The input latent tensor may comprise an output of a neural network or a signal that is derived from an output of a neural network, where the neural network may be comprised in an encoder. [0185] In one example, at an encoder side, a neural network outputs a latent tensor, where the latent tensor is then encoded in a lossy and/or lossless way; at decoder side, the encoded latent tensor is decoded and provided as input to the LG; any operations that may be needed to decode the encoded latent tensor may be considered to be part of the encoder. See the FIG.7 for an illustration of this example. [0186] In FIG. 7, ^ 702 in an input to an encoder 704 and to an encoder NN 706. An output of an encoder NN 706 is a latent tensor ^ 708, which is lossless-encoded 710, eventually after a quantization operation (not shown in FIG.7), obtaining a bitstream ^ 712 which represents an output of the encoder 704. The bitstream ^ 712 is input to a decoder 714 and to a lossless decoding process 716, obtaining a decoded latent tensor ^̂ 718, eventually after a dequantization operation (not shown in FIG.7). The decoded latent tensor ^̂ 718 is input to a LG 720, obtaining a tensor ^ 722. The tensor ^ 722 is input to a VG 724, obtaining a decoded output ^^ 726, such as a decoded picture or decoded video, which represents also an output of the decoder 714. [0187] In one embodiment, an input to the LG comprises one or more values of respective one or more parameters of the one or more neural networks comprised in the LG. [0188] In one embodiment, the decoder may receive, from an encoder, one or more values of respective one or more parameters of the one or more neural networks of the LG, and may assign the received one or more values to the respective one or more parameters of the one or more neural networks of the LG. [0189] In one embodiment, an input to the LG comprises one or more updates to respective one or more parameters of the one or more neural networks comprised in the LG. [0190] In one embodiment, the decoder may receive, from an encoder, one or more updates to respective one or more parameters of the one or more neural networks of the LG, and may apply the one or more updates to the respective one or more parameters of the one or more neural networks of the LG. [0191] In one embodiment, an input to the LG comprises one or more signals that are input to respective one or more layers or blocks of the one or more neural networks comprised in the LG. [0192] In one embodiment, an input to the LG comprises one or more signals that modulate or modify respective one or more inputs and/or one or more outputs of respective one or more layers or blocks of the one or more neural networks comprised in the LG. [0193] In one embodiment, an input to the LG comprises a signal that is derived from one or more previous outputs of the LG. [0194] In one example, referring to FIG. 8, the decoder 802 is a video decoder. At each iteration, the decoder 802 decodes an encoded picture or an encoded frame of a video (804) into a decoded picture or a decoded frame of a video (e.g., second decoded frame ^^^^^ 814), respectively. At each iteration, an input to the LG 806 may comprise a first generated latent tensor ^^^ 808 that was generated by the LG 806 at a previous iteration, if it is available. At the i- iteration, the LG 806 outputs a first generated latent tensor ^^^ 808, that is provided as input to the VG 810, where the VG 810 outputs a first decoded frame ^^^. At the (i+1)-th iteration, an input to the LG 806 comprises the first generated latent ^^^ 808, and outputs a second generated latent tensor ^^^^^ 812, that is provided as input to the VG 810, where the VG 810 outputs a second decoded frame ^^^^^ 814. For the 0-th iteration (e.g., the first iteration), when there is no previous generated latent tensor available, the LG 806 may not take any previous generated latent tensor or may be provided a dummy previous generated latent tensor, such as a tensor with all zero-valued elements. Thus FIG.8 provides an illustration of this example. [0195] In FIG. 8, “Delay tap” 816 may comprise, for example, a buffer that stores the previous output (from the i-th iteration) 808 of the LG 806 and provides it as an input to the LG 806 at the current (i+1)-th iteration. [0196] It is to be understood that the phrase “one or more locations of data” or its variations, such as “one or more spatial and/or temporal locations of a tensor”, or “one or more spatial and/or temporal locations of a data”, may refer to, for example, one or more coordinates of respective one or more pixels of an image, or one or more indexes of respective one or more frames of a video, or one or more identifiers of respective one or more portions of a data (e.g., one or more identifiers of respective one or more blocks of an image), or one or more indicators of respective one or more portions of a data. [0197] Embodiments on output of the LG [0198] In one embodiment, an output of the LG is a latent tensor, such as a feature tensor or simply features. This latent tensor may be referred to as generated tensor, generated output, generated latent tensor, or output latent tensor. The generated latent tensor may be input to the VG. [0199] In one embodiment, an output of the LG may be combined with a combination signal by means of a combination operation, obtaining a combination output. The combination output may be an input to the VG. For example, the combination operation may be an element-wise sum. The combination signal may be, for example, an output of another neural network, or may be derived based on a signal received from an encoder. [0200] In an alternative embodiment, referring to FIG. 9, an output of the LG 902 of decoder 900 comprises a prediction of a latent tensor, also referred to as generated latent tensor prediction 904. In an additional embodiment, when the output of the LG 902 comprises a generated latent tensor prediction 904, the generated latent tensor prediction 904 may be combined with a latent tensor residual ^^^^ 906 by means of a combination operation 908. For example, the combination operation 908 may comprise a summation. [0201] In FIG. 9, ^^^^^ 904 represents the generated latent tensor prediction, ^^^^ 906 represents the latent tensor residual, ^ 910 represents a latent tensor that is the result of combination operation 908 the generated latent tensor prediction 904 and the latent tensor residual ^^^^ 906 by means of an element-wise summation. [0202] In an additional embodiment, the latent tensor residual may be an output of a neural network. In addition or alternatively, the latent tensor residual may be derived from a signal received from an encoder. In one example, an encoder determines an uncompressed latent tensor residual by subtracting a ground-truth latent tensor from a generated latent tensor prediction that is output by an LG, encodes the uncompressed latent tensor residual in a lossy and/or lossless way; at decoder side, the encoded latent tensor residual is decoded into a decoded latent tensor residual; the decoded latent tensor residual is then used as the latent tensor residual that is combined with the generated latent tensor prediction. FIG. 10 illustrates this example. Please note that some of the steps or operations performed by the encoder, such as determining the uncompressed latent tensor residual, are not illustrated in FIG.10 for simplicity. [0203] In FIG. 10, at encoder side, an uncompressed latent tensor residual ^^^^ 1004 is lossless-encoded 1006 into a bitstream ^ 1008. At decoder side, the bitstream ^ 1008 is decoded 1012 into a latent tensor residual ^^^^ 1014 which is summed 1016 to the generated latent tensor prediction 1018 generated with LG 1020. [0204] In a yet alternative embodiment, referring to FIG.11, an output of the LG 1102 of a decoder 1101 comprises a residual of a latent tensor, also referred to as generated latent tensor residual 1104. In an additional embodiment, when an output of the LG 1102 comprises a generated latent tensor residual 1104, the generated latent tensor residual 1104 may be combined with a latent tensor prediction 1106 by means of a combination operation 1108. For example, the combination operation 1108 may comprise a summation. FIG. 11 provides an illustration of this embodiment. [0205] In FIG. 11, ^^^^^ 1106 represents a latent tensor prediction, ^^^^ 1104 represents a generated latent tensor residual, ^ 1110 represents a latent tensor that is the result of combining the latent tensor prediction 1106 and the generated latent tensor residual 1104 by means of an element-wise summation (e.g., the combination operation 1108). [0206] In an additional embodiment, referring to FIG.12, the latent tensor prediction 1202 may be an output of a neural network, such as a NN based predictor 1204. FIG.12 illustrates this embodiment, where in the example shown in FIG. 12, the NN based predictor 1204 is part of decoder 1201. [0207] In one embodiment, an input to the NN based predictor 1204 may comprise, but may not be limited to, one or more signals that are derived from respective one or more of the following: a previous generated latent tensor; or a previous latent tensor that is the result of combining a previous generated latent tensor residual and a previous latent tensor prediction; or a previous decoded picture. [0208] Embodiments on input to the VG [0209] In one embodiment, an input to the VG is a signal or tensor that is the result of a combination of an output of the LG and of another signal such as a prediction signal or a residual signal. For example, an input to the VG is a tensor that is obtained as the result of an element-wise sum of a generated latent tensor prediction and a latent tensor residual. [0210] In one embodiment, an input to the VG comprises a signal that is derived from one or more previous outputs of the LG. [0211] In one example, referring to FIG.13, the decoder 1301 is a video decoder. At each iteration, the decoder 1301 decodes an encoded picture or an encoded frame of a video (1303) into a decoded picture or a decoded frame (e.g., a second decoded frame ^^^^^ 1312) of a video, respectively. At each iteration, an input to the VG 1310 comprises a generated latent tensor that was output by the LG 1302 at a previous iteration, when it is available, and a latent tensor that is output by the LG 1302 at the current iteration. For example, at the i-th iteration, the LG 1302 outputs a first generated latent tensor ^^^ 1306, that is provided as input to the VG 1310, where the VG 1310 outputs a first frame ^^^. At the (i+1)th iteration, the LG 1302 outputs a second generated latent tensor ^^^^^ and an input to the VG 1310 comprises the second generated latent tensor ^^^^^ 1308 and the first generated latent tensor ^^^ 1306. At (i+1)th iteration, an output of the VG 1310 comprises the second decoded frame ^^^^^ 1312. For the 0-th iteration (e.g., the first iteration), when there is no previous generated latent tensor available, the VG 1310 may not take any previous generated latent tensor or may be provided a dummy previous generated latent tensor, such as a tensor with all zero-valued elements. FIG.13 illustrates this example. [0212] In one embodiment, an input to the VG comprises a signal that is derived from one or more previous outputs of the VG. [0213] In one example, referring to FIG.14, the decoder 1401 is a video decoder. At each iteration, the decoder 1401 decodes an encoded picture or an encoded frame of a video (1402) into a decoded picture or a decoded frame (e.g., a second decoded frame ^^ ^^^1410) of a video, respectively. At each iteration, an input to the VG 1408 comprises an output of the VG 1408 at a previous iteration, when it is available, and a generated latent tensor that is output by the LG 1404 at the current iteration. For example, at the i-th iteration, the LG 1404 outputs a first generated latent tensor ^^^, that is provided as input to the VG 1408, where the VG 1408 outputs a first decoded frame ^^^ 1406. At the (i+1)th iteration, the LG 1404 outputs a second generated latent tensor ^^^^^ 1405. An input to the VG 1408 comprises the second generated latent tensor ^^ ^^^ 1405 and the first decoded frame ^^ ^ 1406. An output of the VG 1408 comprises a second decoded frame ^^^^^ 1410. For the 0-th iteration (e.g., the first iteration), when there is no previous decoded frame available, the VG 1408 may not take any previous decoded frame, or may be provided a dummy decoded frame, such as a tensor with all zero-valued elements. FIG. 14 illustrates this example. A delay tap 1412 stores the output (e.g., the first decoded frame ^^^1406, the second decoded frame ^^^^^1410) of the VG 1408 that is provided as input to the VG 1408 at a subsequent iteration. [0214] In an additional embodiment, an input to the LG may be used as an input to the VG. In one example, where an input to the LG comprises one or more spatial and/or temporal locations, the one or more spatial and/or temporal locations, or a signal derived from the one or more spatial and/or temporal locations, are/is provided as an input to the VG and represent respective spatial and/or temporal locations of a decoded picture or decoded video. [0215] In one embodiment, an input to the VG comprises one or more values of respective one or more parameters of the one or more neural networks comprised in the VG. [0216] In one embodiment, an input to the VG comprises one or more updates to respective one or more parameters of the one or more neural networks comprised in the VG. [0217] In one embodiment, an input to the VG comprises one or more signals that are input to respective one or more layers or blocks of the one or more neural networks comprised in the VG. [0218] In one embodiment, an input to the VG comprises one or more signals that modulate or modify respective one or more inputs and/or one or more outputs of respective one or more layers or blocks of the one or more neural networks comprised in the VG. [0219] Embodiments on output of the VG [0220] In one embodiment, an output of the VG comprises a decoded data, such as a decoded image, or decoded video, or decoded audio, or decoded media data. [0221] In one embodiment, when an input to the LG and/or to the VG comprises one or more indicators of respective one or more portions of data to decode, an output of the VG comprises decoded one or more portions of data, and wherein the decoded one or more portions of data correspond to or are indicated by the one or more indicators. [0222] Additional embodiments on signaling features [0223] In an embodiment, the decoder receives from an encoder one or more indicators of respective one or more portions of data to decode, or a signal derived from one or more indicators of respective one or more portions of data to decode such as encoded or compressed one or more indicators, where the one or more indicators may be used as an input to the LG and/or the VG. When the decoder receives encoded or compressed one or more indicators, the decoder may first decode or decompress the encoded or compressed one or more indicators, obtaining decoded or decompressed one or more indicators, and the decoded or decompressed one or more indicators may be used as an input to the LG and/or the VG. [0224] In an embodiment, the decoder receives from an encoder one or more spatial and/or temporal locations or identifiers, or a signal derived from one or more spatial and/or temporal locations or identifiers, where the one or more spatial and/or temporal locations or identifiers, or the signal derived from the one or more spatial and/or temporal locations or identifiers are used as an input to the LG. [0225] Examples [0226] In one example, referring to FIG.15, an image codec 1500 comprises an encoder 1502 and a decoder 1522, where the decoder 1522 is based at least on some of the embodiments above, thus comprising at least a LG 1532 and a VG 1536, where the LG 1532 and the VG 1536 comprise a neural network. The encoder 1502 encodes a first information 1508 (e.g., by using a block “Spatial coordinates encoding” 1510) about one or more spatial locations of an image to be decoded by the decoder 1522, such as one or more spatial coordinates; this encoded first information may also be referred to as encoded spatial locations ^^^^ 1514. It is to be noted that the spatial coordinates may comprise normalized spatial coordinates, for example normalized to a range between 0 and 1. Also, the encoder 1502 encodes second information (e.g., by using a block “Parameter encoding” 1512)about one or more values ^ 1509 of respective one or more parameters of the neural network comprised in the LG 1532; this encoded second information may be referred to also as encoded parameter values ^^^^1516. The encoded spatial locations ^^^^ 1514 and the encoded parameter values ^^^^ 1516 are provided to the decoder 1522. The decoder 1522 decodes (e.g., by using a block “Spatial coordinates decoding” 1524, block “Parameter decoding” 1526) the encoded spatial locations ^^^^1514 and the encoded parameter values ^^^^1516, obtaining the one or more spatial locations or coordinates ^̂ 1528 and the decoded one or more parameter values ^^ 1530. The decoded one or more parameter values ^^ 1530 are assigned to the respective parameters in the neural network comprised in the LG 1532. The spatial locations or coordinates ^̂ 1528 are input to the LG 1532. The LG 1532 is executed or run, obtaining a generated latent tensor ^ 1534. The generated latent tensor ^ 1534 is input to the VG 1536. The VG 1536 is executed or run, obtaining a decoded image or one or more decoded pixels ^^ 1540. [0227] The one or more values ^ 1509 of respective one or more parameters of the neural network comprised in the LG 1532 are determined at encoder side based at least on a parameter determination process(e.g., by using a block “Determine LG parameters” 1506). In one example, the parameter determination process comprises performing a training process, or a finetuning process, or an overfitting process, on the neural network comprised in the LG 1532, based at least on the image(s) to be decoded by the decoder. FIG 15 thus illustrates this example. It is to be understood that the encoder 1502 may comprise also some or all of the components of the decoder 1522, even if they are not illustrated in FIG. 15 to be part of the encoder 1502. [0228] In FIG. 15, an image ^ 1503 is input to a block “Determine spatial coordinates” 1504 which determines one or more spatial coordinates, represented in the figure by ^ 1508, that identify respective one or more spatial locations of respective one or more pixels to be decoded. The one or more spatial coordinates are encoded by a block “Spatial coordinates encoding” 1510, that may perform, for example, lossless encoding, obtaining a first bitstream comprising the encoded spatial locations ^^^^ 1514. The first bitstream comprising the encoded spatial locations ^^^^1514 is decoded by a block “Spatial coordinates decoding” 1524, obtaining decoded one or more spatial locations or coordinates ^̂ 1528. The image ^ 1503 is input to the block “Determine LG parameters” 1506, which determines one or more values of respective one or more parameters of a neural network in the LG 1532, where these one or more values are represented by ^ 1509. The one or more values ^ 1509 are encoded by a block “Parameter encoding” 1512, obtaining a second bitstream comprising the encoded parameter values ^^^^ 1516, where the encoding, by the block “Parameter encoding” 1512, may comprise a lossy encoding operation, such as a quantization operation, and a lossless decoding operation. The second bitstream comprising the encoded parameter values ^^^^1516 is decoded by a block “Parameter decoding” 1526, obtaining decoded one or more parameter values ^^ 1530, where the decoding by the block “Parameter decoding” 1526 may comprise, for example, a lossless decoding operation and a dequantization operation. The decoded one or more parameter values ^^1530 are used as values for respective one or more parameters of the neural network in the LG 1532. In particular, the decoder 1522 assigns the decoded one or more parameter values ^^1530 to the respective one or more parameters of the neural network in the LG 1532. The LG 1532 is executed or run on the input represented by the decoded one or more spatial locations or coordinates ^̂ 1528, obtaining a generated latent tensor ^ 1534, The generated latent tensor ^ 1534 is input to the VG 1536. The VG 1536 is run or executed, obtaining a decoded image or one or more decoded pixels ^^ 1540. [0229] It is to be noted that the block determine spatial coordinates 1504 may be considered either as part of the encoder 1502 or as a separate component with respect to the encoder 1502. It is to be noted that the block determine LG parameters 1506 of the LG parameters may be considered either as part of the encoder 1502 or as a separate component with respect to the encoder 1502. [0230] FIG.16 illustrates an example of determining one or more values of respective one or more parameters of the one or more neural networks of the LG 1602. I.e., FIG. 16 illustrates an example of block determine LG parameters 1506 of FIG.15. [0231] In FIG.16, one or more values of respective one or more parameters of the one or more neural networks of the LG 1602 are determined by means of a training or overfitting process 1600 that comprises one or more training or overfitting iterations. At each overfitting iteration, an input ^ 1601 is provided to the LG 1602, where the input ^ 1601 may comprise spatial coordinates (for example, vertical and horizontal spatial coordinates referring to the vertical and horizontal spatial coordinates of the pixels to be decoded by the decoder). The LG 1602 is run or executed, obtaining a generated latent tensor ^ 1604, which is input to the VG 1606. The VG 1606 is run or executed, obtaining a decoded image ^^ 1608. A loss function is computed 1610, based at least on the decoded image ^^ 1608 and on a ground-truth image ^ 1603 , where the ground-truth image may be, for example an uncompressed image that is to be encoded. The block 1612 uses the loss function to update the parameters of the neural networks of the LG 1602, and comprises determining updates and applying those updates to the parameters. The block 1612 uses the loss function to determine one or more updates to respective one or more parameters of the one or more neural networks of the LG, for example, by means of backpropagation. The one or more updates are used to update the respective one or more parameters of the one or more neural networks of the LG 1602, obtaining updated one or more values. After the overfitting process has ended, the one or more values of the respective one or more parameters of the one or more neural networks of the LG 1602 are the updated one or more values that were determined at one of the iterations of the overfitting process, such as at the last iteration of the overfitting process. The determined one or more values of respective one or more parameters may represent the output or result of the overfitting process, such as one or more values ^ 1509 in FIG.15. The encoder may include at least some of the components of the decoder, including the LG 1602 and/or the VG 1606, which components may be used during the determination of the update values 1614 of the one or more parameters of the one or more neural networks. [0232] Training features for the VG [0233] In one embodiment, the one or more neural networks in the VG are trained during an offline stage 1700, as follows. Herein, “offline stage” indicates a phase or time that occurs prior to the utilization of the encoder or decoder, or anyway prior to the time when the result of the training (e.g., the trained neural networks comprised in the VG) is to be used in the encoder or decoder. For simplicity, let us assume that the VG comprises one neural network, referred to as a second NN 1704 or second NN 1704. The second NN 1704 is trained together with a first NN 1702, also referred to as first NN 1702. An input 1701, such as a training image (or a batch of training images) from a training dataset, is provided to the first NN 1702. An output 1703 of the first NN 1702 is provided as input to the second NN 1704. An output 1706 from the second NN 1704 is obtained. A loss function is computed 1708 based on the input 1701 to the first NN 1702 and the output 1706 from the second NN 1704. The loss function is used to perform one iteration of training for the first NN 1702 and the second NN 1704, for example by means of gradient descent and backpropagation techniques. One or more training iterations are performed by using one or more training images (or batches of training images) in the training dataset, until a stopping criterion is satisfied. Training provides an update 1710 to the first NN 1702 and an update 1712 to the second NN 1704. [0234] In an additional embodiment, a compression objective may be used during the training. In one example, the compression objective comprises a loss function computed based at least on the output of the first NN or based at least on data derived from the output of the first NN. In a further example, the loss function may estimate an entropy or cross- entropy of the output of the first NN. In a yet further example, the loss function may estimate a bitrate of a bitstream that represents the output of the first NN. In a yet further example, the loss function may be derived from an output of a learned probability model (another neural network that takes as input the output of the first NN or data derived from the output of the first NN). See FIG.18 for an example of this embodiment. [0235] In FIG. 18, a first loss function 1810 is computed based at least on an output ^^ 1808 of the second NN 1806 and on an input ^ 1802 to the first NN 1804; a second loss function 1812 is computed based at least on an output ^ 1805 of the first NN 1804. The first loss function 1810 and the second loss function 1812 are used to train 1814 the first NN 1804 via updated parameters 1816 and the second NN 1806 via updated parameters 1818. [0236] Training features of the LG and VG [0237] In one embodiment the VG is trained offline and the LG is overfitted online (and thus values of at least some of its parameters are determined by an encoder and sent to the decoder), where offline indicates a phase or time that occurs prior to the utilization of the encoder or decoder, or anyway prior to the time when the result of the training (e.g., the trained neural networks comprised in the VG) is used in the encoder or decoder for encoding a data item, and where online refers to a time when the codec is used for encoding/decoding an input data item such as an image, e.g., inference time or test time. A decoder comprises a LG and a VG, the VG is trained offline, the LG is overfitted/trained online, where the encoder determines values of at least some of the parameters of the LG and sends them to the decoder to be assigned to respective parameters of the LG. [0238] Training features for the LG [0239] In one embodiment, the one or more neural networks of the LG are not trained or pretrained in an offline stage. In one example, they are randomly initialized, e.g., the values of the parameters of the one or more neural networks of the LG are initialized according to a (pseudo-)random sampling of a probability distribution. [0240] In an alternative embodiment, the one or more neural networks of the LG are trained or pretrained in an offline stage. [0241] In one example, referring to FIG. 19, the one or more NNs of the LG are trained jointly with the one or more NNs of the VG. In this example, for simplicity, let us assume that the VG comprises one neural network, referred to as a second NN 1910 or second NN 1910; also, let us assume that the LG comprises one neural network, referred to as third NN 1908. The second NN 1910 and third NN 1908 are trained together with a first NN 1906, also referred to as first NN 1906. An input 1904, such as a training image (or batch of training images) from a training dataset, is provided to the first NN 1906. An output 1907 of the first NN 1906 is provided as input to the third NN 1908. An output 1909 from the third NN 1908 is obtained. The output 1909 of the third NN 1908 is provided as input to the second NN 1910. An output 1912 from the second NN 1910 is obtained. A loss function is computed 1914 based on the input 1904 to the first NN 1906 and the output 1912 from the second NN 1910. The loss function is used to perform one iteration of training 1916 for the first NN 1906, the second NN 1910 and the third NN 1908, for example by means of gradient descent and backpropagation techniques. One or more training iterations are performed by using one or more training images (or batches of training images) in the training dataset, until a stopping criterion is satisfied. Training 1916 provides an update 1918 to the first NN 1906, an update 1920 to the second NN 1910, and an update 1922 to the third NN 1908. [0242] In another example, referring to FIG.20, the one or more NNs of the LG are trained after the one or more NNs of the VG have been trained and after a first NN has been trained. In this example, for simplicity, let us assume that the VG comprises one neural network, referred to as a pretrained second NN 2010 and the LG comprises one neural network, referred to as third NN 2008. As the pretrained first NN 2006 and the pretrained second NN 2010 have been trained, they may be referred to as pretrained first NN 2006 and pretrained second NN 2010, respectively. An input 2004, such as a training image (or batch of training images) from a training dataset, is provided to the pretrained first NN 2006. An output 2007 of the pretrained first NN 2006 is provided as input to the third NN 2008. An output 2009 from the third NN 2008 is obtained. The output 2009 of the third NN 2008 is provided as input to the pretrained second NN 2010. An output 2012 from the pretrained second NN 2010 is obtained. A loss function 2014 is computed based on the input 2004 to the pretrained first NN 2006 and the output 2012 from the pretrained second NN 2010. The loss function 2014 is used to perform one iteration of training 2016 for the third NN 2008, for example by means of gradient descent and backpropagation techniques. One or more iterations of training 2016 are performed by using one or more training images (or batches of training images) in the training dataset, until a stopping criterion is satisfied. The one or more iterations of training 2016 is used to provide an update 2022 to third NN 2008 of the LG. [0243] In yet another example, first the one or more NNs of the VG are trained, then the one or more NNs of the LG are trained jointly with a finetuning of the one or more NNs of the VG. [0244] In one embodiment, the one or more neural networks of the LG and, optionally, the one or more neural networks in the VG, are trained or pretrained in an offline stage by means of a meta-learning algorithm, such as an algorithm that is based on the Model- Agnostic Meta Learning (MAML) algorithm. In one example, a training iteration comprises one or more inner training iterations and an outer training iteration, wherein an inner training iteration comprises overfitting the NNs in the LG based on an image and computing an overfitting performance score, and wherein the outer training iteration comprises training at least the NNs in the LG and, optionally, the NNs in the VG, based at least on a loss function that is computed based at least on one or more performance scores computed during the respective one or more inner training iterations. [0245] Finetuning features (for LG and VG) [0246] In an embodiment, where the one or more NNs in the LG have been trained or pretrained in an offline stage, the one or more NNs in the LG may be finetuned at test time. An encoder determines one or more update values for respective one or more parameters of the one or more neural networks in the LG, encodes them and sends the encoded one or more update values to the decoder. At decoder side, the encoded one or more update values are decoded, obtaining decoded one or more update values. The decoded one or more update values are used to update the respective one or more parameters of the one or more neural networks of the LG. [0247] In an embodiment, the one or more NNs in the VG may be finetuned at test time. An encoder determines one or more update values for respective one or more parameters of the one or more neural networks in the VG, encodes them and sends the encoded one or more update values to the decoder. At decoder side, the encoded one or more update values are decoded, obtaining decoded one or more update values. The decoded one or more update values are used to update the respective one or more parameters of the one or more neural networks of the VG. [0248] FIG.21 shows a layout of an apparatus 50 according to an example embodiment. The apparatus 50 may for example be a mobile terminal or user equipment of a wireless communication system, a sensor device, a tag, or other lower power device. However, the embodiments of the examples described herein may be implemented within any electronic device or apparatus which may encode or decode multimedia content. [0249] The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 further may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the examples described herein the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the examples described herein any suitable data or user interface mechanism may be employed. For example the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. [0250] The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analog signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the examples described herein may be any one of: an earpiece 38, speaker, or an analog audio or digital audio output connection. The apparatus 50 may also comprise a battery (or in other embodiments of the examples described herein the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera capable of recording or capturing images and/or video. The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/firewire wired connection. [0251] As shown in FIG. 21, learned coding 60 may implement the examples described herein related to end-to-end learned coding via overfitting a latent generator. [0252] FIG. 22 is a block diagram illustrating a system 2200 in accordance with several examples. In an example, the encoder 2230 is used to encode an image or video from the scene 2215, and the encoder 2230 is implemented in a transmitting apparatus 2280. The encoder 2230 produces a bitstream 2210 comprising signaling that is received by the receiving apparatus 2282, which implements a decoder 2240. The encoder 2230 sends the bitstream 2210 that comprises the herein described signaling. The decoder 2240 forms the image or video for the scene 2215-1, and the receiving apparatus 2282 would present this to the user, e.g., via a smartphone, television, or projector among many other options. [0253] In some examples, the transmitting apparatus 2280 and the receiving apparatus 2282 are at least partially within a common apparatus, and for example are located within a common housing 2250. In other examples the transmitting apparatus 2280 and the receiving apparatus 2282 are at least partially not within a common apparatus and have at least partially different housings. Therefore in some examples, the encoder 2230 and the decoder 2240 are at least partially within a common apparatus, and for example are located within a common housing 2250. For example the common apparatus comprising the encoder 2230 and decoder 2240 implements a codec. In other examples the encoder 2230 and the decoder 2240 are at least partially not within a common apparatus and have at least partially different housings, but when together still implement a codec. [0254] In some examples, 3D media from the capture (e.g., volumetric capture) at a viewpoint 2212 of the scene 2215, which includes a person 2213) is converted via projection to a series of 2D representations with occupancy, geometry, attributes and/or displacements. Additional atlas information is also included in the bitstream to enable inverse reconstruction. For decoding, the received bitstream 2210 is separated into its components with atlas information; occupancy, geometry, displacement, and attribute 2D representations. A 3D reconstruction is performed to reconstruct the scene 2215-1 created looking at the viewpoint 2212-1 with a “reconstructed” person 2213-1. The “-1” are used to indicate that these are reconstructions of the original. As indicated at 2220, the decoder 2240 performs an action or actions based on the received signaling. [0255] Encoding 2290 performs encoder-side learning coding based on the examples described herein. Decoding 2292 performs decoder-side learning coding, based on the examples described herein. [0256] FIG. 23 is an example apparatus 2300, which may be implemented in hardware, configured to implement the examples described herein. The apparatus 2300 comprises at least one processor 2302 (e.g., an FPGA and/or CPU), one or more memories 2304 including computer program code 2305, the computer program code 2305 having instructions to carry out the methods described herein, wherein the one or more memories 2304 and the computer program code 2305 are configured to, with the at least one processor 2302, cause the apparatus 2300 to implement circuitry, a process, component, module, or function (implemented with control module 2306) to implement the examples described herein. [0257] Apparatus 2300 may be a smartphone, personal digital device or assistant, smart television, laptop, tablet, head-mounted display (HMD) or other user device or terminal device. The memory 2304 may be a non-transitory memory, a transitory memory, a volatile memory (e.g. RAM), or a non-volatile memory (e.g., ROM). [0258] Learned coding 2330 implements the examples described herein related to end-to- end learned coding via overfitting a latent generator. [0259] The apparatus 2300 includes a display and/or I/O interface 2308, which includes user interface (UI) circuitry and elements, that may be used to display features or a status of the methods described herein (e.g., as one of the methods is being performed or at a subsequent time), or to receive input from a user such as with using a keypad, camera, touchscreen, touch area, microphone, biometric recognition, one or more sensors, etc. The apparatus 2300 includes one or more communication e.g. network (N/W) interfaces (I/F(s)) 2310. The communication I/F(s) 2310 may be wired and/or wireless and communicate over the Internet/other network(s) via any communication technique including via one or more links 2324. The communication I/F(s) 2310 may comprise one or more transmitters or one or more receivers. [0260] The transceiver 2316 comprises one or more transmitters 2318 and one or more receivers 2320. The transceiver 2316 and/or communication I/F(s) 2310 may comprise standard well-known components such as an amplifier, filter, frequency-converter, (de)modulator, and encoder/decoder circuitries and one or more antennas, such as antennas 2314 used for communication over wireless link 2326. [0261] The control module 2306 of the apparatus 2300 comprises one of or both parts 2306-1 and/or 2306-2, which may be implemented in a number of ways. The control module 2306 may be implemented in hardware as control module 2306-1, such as being implemented as part of the at least one processor 2302. The control module 2306-1 may be implemented also as an integrated circuit or through other hardware such as a programmable gate array. In another example, the control module 2306 may be implemented as control module 2306-2, which is implemented as computer program code (having corresponding instructions) 2305 and is executed by the at least one processor 2302. For instance, the one or more memories 2304 store instructions that, when executed by the at least one processor 2302, cause the apparatus 2300 to perform one or more of the operations as described herein. Furthermore, the at least one processor 2302, one or more memories 2304, and example algorithms (e.g., as flowcharts and/or signaling diagrams), encoded as instructions, programs, or code, are means for causing performance of the operations described herein. [0262] The apparatus 2300 to implement the functionality of control 2306 may correspond to any of the apparatuses depicted herein. Alternatively, apparatus 2300 and its elements may not correspond to any of the other apparatuses depicted herein, as apparatus 2300 may be part of a self-organizing/optimizing network (SON) node or other node, such as a node in a cloud. [0263] The apparatus 2300 may also be distributed throughout the network including within and between apparatus 2300 and any network element (such as a base station and/or terminal device and/or user equipment). [0264] Interface 2312 enables data communication and signaling between the various items of apparatus 2300, as shown in FIG.23. For example, the interface 2312 may be one or more buses such as address, data, or control buses, and may include any interconnection mechanism, such as a series of lines on a motherboard or integrated circuit, fiber optics or other optical communication equipment, and the like. Computer program code (e.g. instructions) 2305, including control 2306 may comprise object-oriented software configured to pass data or messages between objects within computer program code 2305. Computer program code (e.g. instructions) 2305, including control 2306 may comprise procedural, functional, or scripting code. The apparatus 2300 need not comprise each of the features mentioned, or may comprise other features as well. The various components of apparatus 2300 may at least partially reside in a common housing 2328, or a subset of the various components of apparatus 2300 may at least partially be located in different housings, which different housings may include common housing 2328. [0265] FIG. 24 shows a schematic representation of non-volatile memory media 2400a (e.g. computer/compact disc (CD) or digital versatile disc (DVD)) and 2400b (e.g. universal serial bus (USB) memory stick) and 2400c (e.g. cloud storage for downloading instructions and/or parameters 2402 or receiving emailed instructions and/or parameters 2402) storing instructions and/or parameters 2402 which when executed by a processor allows the processor to perform one or more of the operations of the methods described herein. Instructions and/or parameters 2402 may represent or correspond to a non-transitory computer readable medium. [0266] FIG. 25 shows an encoder 2500 according to an embodiment. FIG. 25 illustrates an image to be encoded (In), a predicted representation of an image block (P′n), a prediction error signal (Dn), a reconstructed prediction error signal (D′n), a preliminary reconstructed image (I′n), a final reconstructed image (R′n), a transform (T) and inverse transform (T−1), a quantization (Q) and inverse quantization (Q−1), entropy encoding (E), a reference frame memory (RFM), inter prediction (Pinter), intra prediction (Pintra), mode selection (MS) and filtering (F). Learned coding 2530 implements the examples described herein. [0267] FIG.26 shows a decoder 2600 according to an embodiment. FIG.26 illustrates a predicted representation of an image block (P′n), a reconstructed prediction error signal (D′n), a preliminary reconstructed image (I′n), a final reconstructed image (R′n), an inverse transform (T−1), an inverse quantization (Q−1), an entropy decoding (E1), a reference frame memory (RFM), a prediction (either inter or intra) (P), and filtering (F). Learned coding 2630 implements the examples described herein. [0268] FIG. 27 is an example method 2700, based on the examples described herein. At 2710, the method includes executing a latent generator of a decoder using a first input to generate a generated output. At 2720, the method includes executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output. Method 2700 may be performed with decoder 116, decode picture buffer block or circuit 122, decoder 204, VCM decoder 311, decoder 500, decoder 602, decoder 714, decoder 802, decoder 900, decoder 1010, decoder 1101, decoder 1201, decoder 1301, decoder 1401, decoder 1522, apparatus 50, receiving apparatus 2282 with decoder 2240, apparatus 2300, or decoder 2600. [0269] FIG. 28 is an example method 2800, based on the examples described herein. At 2810, the method includes encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data. At 2820, the method includes providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder. Method 2800 may be performed with encoder 114, encoder parameter control block or circuit 108, encoder 202, VCM encoder 304, encoder 400, encoder 704, encoder 1002, encoder 1502, apparatus 50, transmitting apparatus 2280 with encoder 2230, apparatus 2300, or encoder 2500. [0270] FIG.29 is an example method 2900, based on the examples described herein. At 2910, the method includes receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder. At 2920, the method includes decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. Method 2900 may be performed with decoder 116, decode picture buffer block or circuit 122, VCM decoder 311, decoder 500, decoder 602, decoder 714, decoder 802, decoder 900, decoder 1010, decoder 1101, decoder 1201, decoder 1301, decoder 1401, decoder 1522, apparatus 50, receiving apparatus 2282 with decoder 2240, apparatus 2300, or decoder 2600. [0271] FIG.30 is an example method 3000, based on the examples described herein. At 3010, the method includes determining one or more values of respective one or more parameters of one or more neural networks of a latent generator. At 3020, the method includes wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator. At 3030, the method includes encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values. At 3040, the method includes providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator. Method 3000 may be performed with encoder 114, encoder parameter control block or circuit 108, encoder 202, VCM encoder 304, encoder 400, encoder 704, encoder 1002, encoder 1502, apparatus 50, transmitting apparatus 2280 with encoder 2230, apparatus 2300, or encoder 2500. [0272] The following examples are provided and described herein. [0273] Example 1. An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: execute a latent generator of a decoder using a first input to generate a generated output; and execute a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output. [0274] Example 2. The apparatus of example 1, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive one or more indicators of respective one or more portions of data to be decoded by the decoder; execute the latent generator using the first input to obtain the generated output, wherein the first input comprises the one or more indicators of respective one or more portions of data to be decoded by the decoder; and execute the media generator using the second input to obtain decoded one or more portions of data, wherein the second input to the media generator comprises the generated output, and wherein the decoded media data comprises the decoded one or more portions of data, and wherein the decoded one or more portions of data correspond to or are indicated by the one or more indicators. [0275] Example 3. The apparatus of example 2, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive encoded spatial and/or temporal locations comprising information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image; decode the encoded spatial and/or temporal locations to obtain the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; wherein the first input to the latent generator comprises the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; execute the latent generator using the first input to obtain a generated tensor, wherein the generated output comprises the generated tensor; and execute the media generator using the second input to obtain one or more decoded pixels or decoded blocks, wherein the second input to the media generator comprises the generated output, and wherein the decoded media data comprises the one or more decoded pixels or decoded blocks, and wherein the one or more decoded pixels or decoded blocks correspond to or are identified by the one or more spatial and/or temporal locations. [0276] Example 4. The apparatus of example 3, wherein the information about the one or more temporal locations of the respective one or more pixels or blocks of the image comprises one or more frame indexes. [0277] Example 5. The apparatus of any of examples 1 to 4, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive a bitstream comprising a lossless encoding of an input latent tensor; and perform a lossless decoding of the bitstream to obtain a decoded input latent tensor; wherein the first input to the latent generator comprises the decoded input latent tensor. [0278] Example 6. The apparatus of any of examples 1 to 5, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of the latent generator of the decoder; decode the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder; wherein the first input to the latent generator comprises the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0279] Example 7. The apparatus of example 6, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: assign the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder to the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0280] Example 8. The apparatus of any of examples 1 to 7, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: derive a signal from one or more previous outputs of the latent generator, if it is available; wherein the first input to the latent generator comprises the signal derived from the one or more previous outputs of the latent generator, if it is available. [0281] Example 9. The apparatus of example 8, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: execute, at a first iteration, the latent generator using the first input to generate a first generated latent tensor, wherein the generated output comprises the first generated latent tensor; execute, at the first iteration, the media generator using as the second input the first generated latent tensor to generate a first decoded frame, wherein the decoded media data comprise the first decoded frame; execute, at a second iteration subsequent to the first iteration, the latent generator to generate a second generated latent tensor, wherein the first input to the latent generator comprises the first generated latent tensor, wherein the generated output comprise the second generated latent tensor; and execute, at the second iteration subsequent to the first iteration, the media generator to generate a second decoded frame that is subsequent to the first decoded frame, wherein the second input comprises the second generated latent tensor, wherein the decoded media data comprises the second decoded frame. [0282] Example 10. The apparatus of example 9, wherein at the first iteration, the first input to the latent generator comprises a tensor with zero-valued elements. [0283] Example 11. The apparatus of any of examples 9 to 10, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: store, in a buffer, the first generated latent tensor generated with the latent generator at the first iteration; and provide, from the buffer to the latent generator, the first generated latent tensor for the latent generator to use as part of the first input at the second iteration subsequent to the first iteration to generate the second generated latent tensor. [0284] Example 12. The apparatus of any of examples 1 to 11, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: combine, using a combination operation, the generated output of the latent generator with a combination signal to generate a combined generated output; and execute the media generator using the second input to obtain the decoded media data, wherein the second input comprises the combined generated output. [0285] Example 13. The apparatus of example 12, wherein: the generated output of the latent generator comprises a prediction, and the combination signal comprises a residual. [0286] Example 14. The apparatus of example 13, wherein the residual comprises an output of a neural network. [0287] Example 15. The apparatus of any of examples 13 to 14, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive a signal from an encoder comprising an encoded residual; and derive the residual from the signal received from the encoder comprising the encoded residual. [0288] Example 16. The apparatus of any of examples 12 to 15, wherein: the generated output of the latent generator comprises a residual, and the combination signal comprises a prediction. [0289] Example 17. The apparatus of any of examples 1 to 16, wherein the first input to the latent generator comprises a signal that indicates one or more viewpoints or viewing directions of an item of data. [0290] Example 18. The apparatus of any of examples 1 to 17, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: update one or more neural networks of the latent generator using one or more updates to respective one or more parameters of the one or more neural networks of the latent generator; wherein an input to the latent generator comprises the one or more updates to the respective one or more parameters of the one or more neural networks of the latent generator. [0291] Example 19. The apparatus of any of examples 1 to 18, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: apply one or more neural networks of the latent generator during the execution of the latent generator using the first input to generate the generated output. [0292] Example 20. The apparatus of example 19, wherein the one or more neural networks of the latent generator are not trained or pretrained in an offline stage. [0293] Example 21. The apparatus of example 20, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: initialize one or more values of one or more parameters of the one or more neural networks of the latent generator based on a random or pseudo-random sampling of a probability distribution. [0294] Example 22. The apparatus of any of examples 1 to 21, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: derive a signal from one or more previous outputs of the latent generator generated with the latent generator prior to generating the generated output, if it is available; wherein the media generator is executed using the signal derived from the one or more previous outputs of the latent generator generated with the latent generator prior to generating the generated output, if it is available. [0295] Example 23. The apparatus of any of examples 1 to 22, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: derive a signal from one or more previous outputs of the media generator generated with the media generator prior to generating the decoded media data, if it is available; wherein the media generator is executed using the signal derived from the one or more previous outputs of the media generator generated with the media generator prior to generating the decoded media data, if it is available. [0296] Example 24. The apparatus of any of examples 1 to 23, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: execute the media generator using the decoded media data generated with the media generator, to generate another item of decoded media data. [0297] Example 25. The apparatus of any of examples 1 to 24, wherein the media generator comprises one or more neural networks. [0298] Example 26. The apparatus of example 25, wherein a first neural network and the one or more neural networks of the media generator are trained during an offline or development phase with one or more training iterations until a stopping criterion is satisfied, wherein at least one of the one or more training iterations performed during the offline or development phase comprises: executing the first neural network using an input to the first neural network to generate an output of the first neural network; executing the one or more neural networks of the media generator using as input the output of the first neural network, to generate an output of one or more neural networks of the media generator; computing a loss function based on the input to the first neural network and the output of the one or more neural networks of the media generator; and updating at least one value of at least one parameter of the first neural network and at least one value of at least one parameter of the one or more neural networks of the media generator, based at least on the loss function. [0299] Example 27. The apparatus of example 26, wherein the one or more training iterations performed during the offline or development phase used to train the first neural network and the one or more neural networks of the media generator are performed before both of: executing the latent generator of the decoder using the first input to generate the generated output, and executing the media generator of the decoder using the second input comprising the generated output or the data derived from the generated output to produce the decoded media data. [0300] Example 28. The apparatus of any of examples 1 to 27, wherein the decoded media data comprises decoded visual data or decoded audio data. [0301] Example 29. The apparatus of any of examples 1 to 28, wherein the latent generator comprises one or more latent generator neural networks and the media generator comprises one or more media generator neural networks, wherein the one or more media generator neural networks are trained during an offline or development phase, and wherein one or more values or update values of respective one or more parameters of the one or more latent generator neural networks are received by the decoder during an online phase or inference phase and are assigned or applied to the respective one or more parameters of the one or more latent generator neural networks. [0302] Example 30. An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: encode, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and provide, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder. [0303] Example 31. The apparatus of example 30, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image, based on an input; encode the information about one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image to obtain encoded spatial and/or temporal locations; and provide, to a decoder, the encoded spatial and/or temporal locations comprising information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image. [0304] Example 32. The apparatus of example 31, wherein the information about the one or more temporal locations of the respective one or more pixels or blocks of the image comprises one or more frame indexes. [0305] Example 33. The apparatus of any of examples 30 to 32, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode, using the encoder neural network, information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image to generate a latent tensor; wherein the input comprising media data comprises the information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image, and the output comprising the encoded media data comprises the latent tensor; wherein the latent tensor comprises the encoded information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; and perform lossless encoding of the latent tensor comprising the encoded information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image to obtain a bitstream. [0306] Example 34. The apparatus of any of examples 30 to 33, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encode information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and provide, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator. [0307] Example 35. The apparatus of example 34, wherein the training or finetuning or overfitting process used to determine the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator comprises one or more training or finetuning or overfitting iterations, wherein at least one of the one or more training or finetuning or overfitting iterations comprises: providing an input to the latent generator, wherein the input to the latent generator comprises information related to a spatial and/or temporal location of an image or other media data; executing the latent generator using the input to the latent generator to obtain a training latent tensor; executing the media generator using the training latent tensor as input to obtain a training decoded image or other training decoded media data; computing a loss function, based on the training decoded image or other training decoded media data and ground-truth media data; determining one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, based at least on the loss function; and updating the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, using the updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator. [0308] Example 36. The apparatus of example 35, wherein the ground-truth media data comprises an input to the encoder comprising media data. [0309] Example 37. The apparatus of example 35, wherein the encoder comprises at least one or more of: the latent generator, or the media generator, or a component for computing the loss function, or a component for determining the one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, or the decoder. [0310] Example 38. The apparatus of any of examples 30 to 37, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode a latent tensor residual into a bitstream. [0311] Example 39. The apparatus of any of examples 30 to 38, wherein the media data comprises visual data or audio data, and the encoded media data comprises encoded visual data or encoded audio data. [0312] Example 40. An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: receive encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decode the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0313] Example 41. The apparatus of example 40, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: execute a latent generator of the decoder using a first input to generate a generated output; and execute a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output. [0314] Example 42. The apparatus of example 41, wherein the first input to the latent generator comprises the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0315] Example 43. The apparatus of any of examples 41 to 42, wherein the decoded media data comprises decoded visual data or decoded audio data, and the latent generator is trained or finetuned or overfitted online. [0316] Example 44. The apparatus of any of examples 40 to 43, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: assign the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder to the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0317] Example 45. An apparatus including: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: determine one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encode information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and provide, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator. [0318] Example 46. The apparatus of example 45, wherein the training or finetuning or overfitting process used to determine the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator comprises one or more training or finetuning or overfitting iterations, wherein at least one of the one or more training or finetuning or overfitting iterations comprises: providing an input to the latent generator, wherein the input to the latent generator comprises information related to a spatial and/or temporal location of an image or other media data; executing the latent generator using the input to the latent generator to obtain a training latent tensor; executing the media generator using the training latent tensor as input to obtain a training decoded image or other training decoded media data; computing a loss function, based on the training decoded image or other training decoded media data and ground-truth media data; determining one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, based at least on the loss function; and updating the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, using the updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator. [0319] Example 47. The apparatus of example 46, wherein the apparatus comprises at least one or more of: the latent generator, or the media generator, or a component for computing the loss function, or a component for determining the one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, or the decoder. [0320] Example 48. The apparatus of any of examples 45 to 47, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and provide, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder. [0321] Example 49. The apparatus of example 48, wherein the media data comprises visual data or audio data, and the encoded media data comprises encoded visual data or encoded audio data. [0322] Example 50. A method including: executing a latent generator of a decoder using a first input to generate a generated output; and executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output. [0323] Example 51. A method including: encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder. [0324] Example 52. A method including: receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0325] Example 53. A method including: determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator. [0326] Example 54. An apparatus including: means for executing a latent generator of a decoder using a first input to generate a generated output; and means for executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output. [0327] Example 55. An apparatus including: means for encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and means for providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder. [0328] Example 56. An apparatus including: means for receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and means for decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0329] Example 57. An apparatus including: means for determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; means for encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and means for providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator. [0330] Example 58. A computer readable medium including instructions stored thereon for performing at least the following: executing a latent generator of a decoder using a first input to generate a generated output; and executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output. [0331] Example 59. A computer readable medium including instructions stored thereon for performing at least the following: encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder. [0332] Example 60. A computer readable medium including instructions stored thereon for performing at least the following: receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder. [0333] Example 61. A computer readable medium including instructions stored thereon for performing at least the following: determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator. [0334] References to a ‘computer’, ‘processor’, etc. should be understood to encompass not only computers having different architectures such as single/multi-processor architectures and sequential /parallel architectures but also specialized circuits such as field- programmable gate arrays (FPGAs), application specific circuits (ASICs), signal processing devices and other processing circuitry. References to computer program, instructions, code etc. should be understood to encompass software for a programmable processor or firmware such as, for example, the programmable content of a hardware device such as instructions for a processor, or configuration settings for a fixed-function device, gate array or programmable logic device, etc. [0335] The term “non-transitory,” as used herein, is a limitation of the medium itself (i.e., tangible, not a signal ) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM). [0336] As used herein, the term ‘circuitry’, ‘circuit’ and variants may refer to any of the following: (a) hardware circuit implementations, such as implementations in analog and/or digital circuitry, and (b) combinations of circuits and software (and/or firmware), such as (as applicable): (i) a combination of processor(s) or (ii) portions of processor(s)/software including digital signal processor(s), software, and one or more memories that work together to cause an apparatus to perform various functions, and (c) circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even when the software or firmware is not physically present. As a further example, as used herein, the term ‘circuitry’ would also cover an implementation of merely a processor (or multiple processors) or a portion of a processor and its (or their) accompanying software and/or firmware. The term ‘circuitry’ would also cover, for example and when applicable to the particular element, a baseband integrated circuit or applications processor integrated circuit for a mobile phone or a similar integrated circuit in a server, a cellular network device, or another network device. Circuitry or circuit may also be used to mean a function or a process used to execute a method. [0337] It should be understood that the foregoing description is only illustrative. Various alternatives and modifications may be devised by those skilled in the art. For example, features recited in the various dependent claims could be combined with each other in any suitable combination(s). In addition, features from different embodiments described above could be selectively combined into a new embodiment. Accordingly, the description is intended to embrace all such alternatives, modifications and variances which fall within the scope of the appended claims. [0338] The following acronyms and abbreviations that may be found in the specification and/or the drawing figures are defined as follows (the abbreviations may be appended with each other or with other characters using e.g. a hyphen, dash (-), or number (or abbreviations having a character may be the same with a character removed), and may be case insensitive): 2D two-dimensional 3D three-dimensional 4D four-dimensional APS adaptation parameter set ASIC application specific integrated circuit AVC advanced video coding BD bit distortion BD-PSNR bit distortion peak signal-to-noise ratio CABAC context-adaptive binary arithmetic coding coo coordinates (e.g. ^^^^) CPU central processing unit CTU coding tree unit DCT discrete cosine transform FPGA field programmable gate array GAN generative adversarial network H.2xx family of video coding standards in the domain of the ITU-T (e.g. H.263, H.264, H.265, H.266, H.274) HEVC high efficiency video coding HMD head-mounted display IBC intra block copy IEC Internatinoal Electrotechnical Commission I/F interface Inv inverse I/O input/output ISO International Organization for Standardization ITU International Telecommunication Union ITU-T ITU Telecommunication Standardization Sector L0 norm number of nonzero elements in the vector L1 norm sum of the absolute vector values L2 norm square root of the sum of the squared vector values LG latent generator MAE mean absolute error MAML model-agnostic meta learning mAP mean average precision MC motion compensation ME motion estimation MPEG moving picture experts group MSE mean squared error MS-SSIM multi-scale structural similarity NAL network abstraction layer NN neural network NNC neural network coding N/W network par parameter (e.g. ^^^^) pred prediction (e.g. ^^^^^) QP quantization parameter RAM random access memory ReLU rectified linear unit res residual (e.g. ^^^^) ROI region of interest ROM read only memory SEI supplemental enhancement information SGD stochastic gradient descent SON self-organizing/optimizing network SSIM structural similarity TV television UI user interface USB universal serial bus VCM video coding for machines VG visual generator VMAF video multimethod assessment fusion VSEI versatile supplemental enhancement information VVC versatile video coding

Claims

CLAIMS What is claimed is: 1. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: execute a latent generator of a decoder using a first input to generate a generated output; and execute a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
2. The apparatus of claim 1, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive one or more indicators of respective one or more portions of data to be decoded by the decoder; execute the latent generator using the first input to obtain the generated output, wherein the first input comprises the one or more indicators of respective one or more portions of data to be decoded by the decoder; and execute the media generator using the second input to obtain decoded one or more portions of data, wherein the second input to the media generator comprises the generated output, and wherein the decoded media data comprises the decoded one or more portions of data, and wherein the decoded one or more portions of data correspond to or are indicated by the one or more indicators.
3. The apparatus of claim 2, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive encoded spatial and/or temporal locations comprising information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image; decode the encoded spatial and/or temporal locations to obtain the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; wherein the first input to the latent generator comprises the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; execute the latent generator using the first input to obtain a generated tensor, wherein the generated output comprises the generated tensor; and execute the media generator using the second input to obtain one or more decoded pixels or decoded blocks, wherein the second input to the media generator comprises the generated output, and wherein the decoded media data comprises the one or more decoded pixels or decoded blocks, and wherein the one or more decoded pixels or decoded blocks correspond to or are identified by the one or more spatial and/or temporal locations.
4. The apparatus of claim 3, wherein the information about the one or more temporal locations of the respective one or more pixels or blocks of the image comprises one or more frame indexes.
5. The apparatus of any of claims 1 to 4, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive a bitstream comprising a lossless encoding of an input latent tensor; and perform a lossless decoding of the bitstream to obtain a decoded input latent tensor; wherein the first input to the latent generator comprises the decoded input latent tensor.
6. The apparatus of any of claims 1 to 5, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of the latent generator of the decoder; decode the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder; wherein the first input to the latent generator comprises the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
7. The apparatus of claim 6, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: assign the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder to the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
8. The apparatus of any of claims 1 to 7, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: derive a signal from one or more previous outputs of the latent generator, if it is available; wherein the first input to the latent generator comprises the signal derived from the one or more previous outputs of the latent generator, if it is available.
9. The apparatus of claim 8, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: execute, at a first iteration, the latent generator using the first input to generate a first generated latent tensor, wherein the generated output comprises the first generated latent tensor; execute, at the first iteration, the media generator using as the second input the first generated latent tensor to generate a first decoded frame, wherein the decoded media data comprise the first decoded frame; execute, at a second iteration subsequent to the first iteration, the latent generator to generate a second generated latent tensor, wherein the first input to the latent generator comprises the first generated latent tensor, wherein the generated output comprise the second generated latent tensor; and execute, at the second iteration subsequent to the first iteration, the media generator to generate a second decoded frame that is subsequent to the first decoded frame, wherein the second input comprises the second generated latent tensor, wherein the decoded media data comprises the second decoded frame.
10. The apparatus of claim 9, wherein at the first iteration, the first input to the latent generator comprises a tensor with zero-valued elements.
11. The apparatus of any of claims 9 to 10, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: store, in a buffer, the first generated latent tensor generated with the latent generator at the first iteration; and provide, from the buffer to the latent generator, the first generated latent tensor for the latent generator to use as part of the first input at the second iteration subsequent to the first iteration to generate the second generated latent tensor.
12. The apparatus of any of claims 1 to 11, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: combine, using a combination operation, the generated output of the latent generator with a combination signal to generate a combined generated output; and execute the media generator using the second input to obtain the decoded media data, wherein the second input comprises the combined generated output.
13. The apparatus of claim 12, wherein: the generated output of the latent generator comprises a prediction, and the combination signal comprises a residual.
14. The apparatus of claim 13, wherein the residual comprises an output of a neural network.
15. The apparatus of any of claims 13 to 14, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: receive a signal from an encoder comprising an encoded residual; and derive the residual from the signal received from the encoder comprising the encoded residual.
16. The apparatus of any of claims 12 to 15, wherein: the generated output of the latent generator comprises a residual, and the combination signal comprises a prediction.
17. The apparatus of any of claims 1 to 16, wherein the first input to the latent generator comprises a signal that indicates one or more viewpoints or viewing directions of an item of data.
18. The apparatus of any of claims 1 to 17, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: update one or more neural networks of the latent generator using one or more updates to respective one or more parameters of the one or more neural networks of the latent generator; wherein an input to the latent generator comprises the one or more updates to the respective one or more parameters of the one or more neural networks of the latent generator.
19. The apparatus of any of claims 1 to 18, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: apply one or more neural networks of the latent generator during the execution of the latent generator using the first input to generate the generated output.
20. The apparatus of claim 19, wherein the one or more neural networks of the latent generator are not trained or pretrained in an offline stage.
21. The apparatus of claim 20, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: initialize one or more values of one or more parameters of the one or more neural networks of the latent generator based on a random or pseudo-random sampling of a probability distribution.
22. The apparatus of any of claims 1 to 21, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: derive a signal from one or more previous outputs of the latent generator generated with the latent generator prior to generating the generated output, if it is available; wherein the media generator is executed using the signal derived from the one or more previous outputs of the latent generator generated with the latent generator prior to generating the generated output, if it is available.
23. The apparatus of any of claims 1 to 22, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: derive a signal from one or more previous outputs of the media generator generated with the media generator prior to generating the decoded media data, if it is available; wherein the media generator is executed using the signal derived from the one or more previous outputs of the media generator generated with the media generator prior to generating the decoded media data, if it is available.
24. The apparatus of any of claims 1 to 23, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: execute the media generator using the decoded media data generated with the media generator, to generate another item of decoded media data.
25. The apparatus of any of claims 1 to 24, wherein the media generator comprises one or more neural networks.
26. The apparatus of claim 25, wherein a first neural network and the one or more neural networks of the media generator are trained during an offline or development phase with one or more training iterations until a stopping criterion is satisfied, wherein at least one of the one or more training iterations performed during the offline or development phase comprises: executing the first neural network using an input to the first neural network to generate an output of the first neural network; executing the one or more neural networks of the media generator using as input the output of the first neural network, to generate an output of one or more neural networks of the media generator; computing a loss function based on the input to the first neural network and the output of the one or more neural networks of the media generator; and updating at least one value of at least one parameter of the first neural network and at least one value of at least one parameter of the one or more neural networks of the media generator, based at least on the loss function.
27. The apparatus of claim 26, wherein the one or more training iterations performed during the offline or development phase used to train the first neural network and the one or more neural networks of the media generator are performed before both of: executing the latent generator of the decoder using the first input to generate the generated output, and executing the media generator of the decoder using the second input comprising the generated output or the data derived from the generated output to generate the decoded media data.
28. The apparatus of any of claims 1 to 27, wherein the decoded media data comprises decoded visual data or decoded audio data.
29. The apparatus of any of claims 1 to 28, wherein the latent generator comprises one or more latent generator neural networks and the media generator comprises one or more media generator neural networks, wherein the one or more media generator neural networks are trained during an offline or development phase, and wherein one or more values or update values of respective one or more parameters of the one or more latent generator neural networks are received by the decoder during an online phase or inference phase and are assigned or applied to the respective one or more parameters of the one or more latent generator neural networks.
30. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: encode, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and provide, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
31. The apparatus of claim 30, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image, based on an input; encode the information about one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image to obtain encoded spatial and/or temporal locations; and provide, to a decoder, the encoded spatial and/or temporal locations comprising information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image.
32. The apparatus of claim 31, wherein the information about the one or more temporal locations of the respective one or more pixels or blocks of the image comprises one or more frame indexes.
33. The apparatus of any of claims 30 to 32, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode, using the encoder neural network, information about one or more spatial and/or temporal locations of respective one or more pixels or blocks of an image to generate a latent tensor; wherein the input comprising media data comprises the information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image, and the output comprising the encoded media data comprises the latent tensor; wherein the latent tensor comprises the encoded information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image; and perform lossless encoding of the latent tensor comprising the encoded information about the one or more spatial and/or temporal locations of the respective one or more pixels or blocks of the image to obtain a bitstream.
34. The apparatus of any of claims 30 to 33, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: determine one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encode information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and provide, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
35. The apparatus of claim 34, wherein the training or finetuning or overfitting process used to determine the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator comprises one or more training or finetuning or overfitting iterations, wherein at least one of the one or more training or finetuning or overfitting iterations comprises: providing an input to the latent generator, wherein the input to the latent generator comprises information related to a spatial and/or temporal location of an image or other media data; executing the latent generator using the input to the latent generator to obtain a training latent tensor; executing the media generator using the training latent tensor as input to obtain a training decoded image or other training decoded media data; computing a loss function, based on the training decoded image or other training decoded media data and ground-truth media data; determining one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, based at least on the loss function; and updating the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, using the updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
36. The apparatus of claim 35, wherein the ground-truth media data comprises an input to the encoder comprising media data.
37. The apparatus of claim 35, wherein the encoder comprises at least one or more of: the latent generator, or the media generator, or a component for computing the loss function, or a component for determining the one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, or the decoder.
38. The apparatus of any of claims 30 to 37, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode a latent tensor residual into a bitstream.
39. The apparatus of any of claims 30 to 38, wherein the media data comprises visual data or audio data, and the encoded media data comprises encoded visual data or encoded audio data.
40. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: receive encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decode the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
41. The apparatus of claim 40, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: execute a latent generator of the decoder using a first input to generate a generated output; and execute a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
42. The apparatus of claim 41, wherein the first input to the latent generator comprises the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
43. The apparatus of any of claims 41 to 42, wherein the decoded media data comprises decoded visual data or decoded audio data, and the latent generator is trained or finetuned or overfitted online.
44. The apparatus of any of claims 40 to 43, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: assign the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder to the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
45. An apparatus comprising: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: determine one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encode information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and provide, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
46. The apparatus of claim 45, wherein the training or finetuning or overfitting process used to determine the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator comprises one or more training or finetuning or overfitting iterations, wherein at least one of the one or more training or finetuning or overfitting iterations comprises: providing an input to the latent generator, wherein the input to the latent generator comprises information related to a spatial and/or temporal location of an image or other media data; executing the latent generator using the input to the latent generator to obtain a training latent tensor; executing the media generator using the training latent tensor as input to obtain a training decoded image or other training decoded media data; computing a loss function, based on the training decoded image or other training decoded media data and ground-truth media data; determining one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, based at least on the loss function; and updating the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, using the updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
47. The apparatus of claim 46, wherein the apparatus comprises at least one or more of: the latent generator, or the media generator, or a component for computing the loss function, or a component for determining the one or more updates to the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator, or the decoder.
48. The apparatus of any of claims 45 to 47, wherein the instructions, when executed by the at least one processor, cause the apparatus at least to: encode, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and provide, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
49. The apparatus of claim 48, wherein the media data comprises visual data or audio data, and the encoded media data comprises encoded visual data or encoded audio data.
50. A method comprising: executing a latent generator of a decoder using a first input to generate a generated output; and executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
51. A method comprising: encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
52. A method comprising: receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
53. A method comprising: determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
54. An apparatus comprising: means for executing a latent generator of a decoder using a first input to generate a generated output; and means for executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
55. An apparatus comprising: means for encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and means for providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
56. An apparatus comprising: means for receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and means for decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
57. An apparatus comprising: means for determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; means for encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and means for providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
58. A computer readable medium comprising instructions stored thereon for performing at least the following: executing a latent generator of a decoder using a first input to generate a generated output; and executing a media generator of the decoder using a second input to generate decoded media data, wherein the second input comprises the generated output or data derived from the generated output.
59. A computer readable medium comprising instructions stored thereon for performing at least the following: encoding, using at least an encoder neural network of an encoder, an input comprising media data to generate an output comprising encoded media data; and providing, within a bitstream, the output comprising the encoded media data generated using the encoder neural network of the encoder.
60. A computer readable medium comprising instructions stored thereon for performing at least the following: receiving encoded parameter values comprising information about one or more values of respective one or more parameters of one or more neural networks of a latent generator of a decoder; and decoding the encoded parameter values to obtain the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator of the decoder.
61. A computer readable medium comprising instructions stored thereon for performing at least the following: determining one or more values of respective one or more parameters of one or more neural networks of a latent generator; wherein the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator are determined using a training or finetuning or overfitting process that comprises one or more training or finetuning or overfitting iterations involving the latent generator and a media generator; encoding information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator to obtain encoded parameter values; and providing, to a decoder, the encoded parameter values comprising the information about the one or more values of the respective one or more parameters of the one or more neural networks of the latent generator.
PCT/IB2025/054064 2024-04-17 2025-04-17 End-to-end learned coding via overfitting a latent generator Pending WO2025219940A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463635211P 2024-04-17 2024-04-17
US63/635,211 2024-04-17

Publications (1)

Publication Number Publication Date
WO2025219940A1 true WO2025219940A1 (en) 2025-10-23

Family

ID=95651192

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2025/054064 Pending WO2025219940A1 (en) 2024-04-17 2025-04-17 End-to-end learned coding via overfitting a latent generator

Country Status (1)

Country Link
WO (1) WO2025219940A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230154055A1 (en) * 2020-04-29 2023-05-18 Deep Render Ltd Image compression and decoding, video compression and decoding: methods and systems
WO2023245460A1 (en) * 2022-06-21 2023-12-28 Microsoft Technology Licensing, Llc Neural network codec with hybrid entropy model and flexible quantization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230154055A1 (en) * 2020-04-29 2023-05-18 Deep Render Ltd Image compression and decoding, video compression and decoding: methods and systems
WO2023245460A1 (en) * 2022-06-21 2023-12-28 Microsoft Technology Licensing, Llc Neural network codec with hybrid entropy model and flexible quantization

Similar Documents

Publication Publication Date Title
US11375204B2 (en) Feature-domain residual for video coding for machines
US11575938B2 (en) Cascaded prediction-transform approach for mixed machine-human targeted video coding
US12170779B2 (en) Training a data coding system comprising a feature extractor neural network
US11638025B2 (en) Multi-scale optical flow for learned video compression
US20240314362A1 (en) Performance improvements of machine vision tasks via learned neural network based filter
US20250211756A1 (en) A method, an apparatus and a computer program product for video coding
US20240146938A1 (en) Method, apparatus and computer program product for end-to-end learned predictive coding of media frames
US12388999B2 (en) Method, an apparatus and a computer program product for video encoding and video decoding
WO2024068081A1 (en) A method, an apparatus and a computer program product for image and video processing
WO2023151903A1 (en) A method, an apparatus and a computer program product for video coding
WO2023031503A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
EP4424014A1 (en) A method, an apparatus and a computer program product for video coding
WO2022084762A1 (en) Apparatus, method and computer program product for learned video coding for machine
WO2023208638A1 (en) Post processing filters suitable for neural-network-based codecs
WO2023089231A1 (en) A method, an apparatus and a computer program product for video encoding and video decoding
WO2024223209A1 (en) An apparatus, a method and a computer program for video coding and decoding
WO2024074231A1 (en) A method, an apparatus and a computer program product for image and video processing using neural network branches with different receptive fields
WO2025219940A1 (en) End-to-end learned coding via overfitting a latent generator
US20250373831A1 (en) End-to-end learned codec for multiple bitrates
US20250310522A1 (en) Quantizing overfitted filters
WO2025202872A1 (en) Minimizing coding delay and memory requirements for overfitted filters
WO2022269441A1 (en) Learned adaptive motion estimation for neural video coding
WO2024068190A1 (en) A method, an apparatus and a computer program product for image and video processing
WO2025133815A1 (en) Overfitting shared multipliers
WO2024209131A1 (en) An apparatus, a method and a computer program for video coding and decoding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25723487

Country of ref document: EP

Kind code of ref document: A1