WO2025196024A1 - Method and data processing system for lossy image or video encoding, transmission and decoding - Google Patents
Method and data processing system for lossy image or video encoding, transmission and decodingInfo
- Publication number
- WO2025196024A1 WO2025196024A1 PCT/EP2025/057322 EP2025057322W WO2025196024A1 WO 2025196024 A1 WO2025196024 A1 WO 2025196024A1 EP 2025057322 W EP2025057322 W EP 2025057322W WO 2025196024 A1 WO2025196024 A1 WO 2025196024A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- optical flow
- flow information
- channels
- representation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/186—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a colour or a chrominance component
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
- H04N19/537—Motion estimation other than block-based
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/59—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving spatial sub-sampling or interpolation, e.g. alteration of picture size or resolution
Definitions
- This invention relates to a method and system for lossy image or video encoding, transmission and decoding, a method, apparatus, computer program and computer readable storage medium for lossy image or video encoding and transmission, and a method, apparatus, computer program and computer readable storage medium for lossy image or video receipt and decoding.
- the compression of image and video content can be lossless or lossy compression.
- lossless compression the image or video is compressed such that all of the original information in the content can be recovered on decompression.
- lossless compression there is a limit to the reduction in data quantity that can be achieved.
- lossy compression some information is lost from the image or video during the compression process.
- Known compression techniques attempt to minimise the apparent loss of information by the removal of information that results in changes to the decompressed image or video that is not particularly noticeable to the human visual system.
- JPEG, JPEG2000, AVC, HEVC and AVI are examples of compression processes for image and/or video files.
- known lossy image compression techniques use the spatial correlations between pixels in images to remove redundant information during compression.
- inter-frame redundancy One technique using inter-frame redundancy that is widely used in standard video compression algorithms involves the categorization of video frames into three types: I-frames, P-frames, and B-frames.
- I-frames or intra-coded frames, serve as the foundation of the video sequence. These frames are self-contained, each one encoding a complete image without reference to any other frame. In terms of compression, I-frames are least compressed among all frame types, thus carrying the most data. However, their independence provides several benefits, including being the starting point for decompression and enabling random access, crucial for functionalities like fast-forwarding or rewinding the video.
- P-frames, or predictive frames utilize temporal redundancy in video sequences to achieve greater compression.
- a P-frame represents the difference between itself and the closest preceding I- or P-frame.
- the process known as motion compensation, identifies and encodes only the changes that have occurred, thereby significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on previous frames for decoding. Consequently, any error during the encoding or transmission process may propagate to subsequent frames, impacting the overall video quality.
- B-frames or bidirectionally predictive frames, represent the highest level of compression. Unlike P-frames, B-frames use both the preceding and following frames as references in their encoding process.
- B-frames By predicting motion both forwards and backwards in time, B-frames encode only the differences that cannot be accurately anticipated from the previous and next frames, leading to substantial data reduction. Although this bidirectional prediction makes B-frames more complex to generate and decode, it does not propagate decoding errors since they are not used as references for other frames.
- Artificial intelligence (AI) based compression techniques achieve compression and decompression of images and videos through the use of trained neural networks in the compression and decompression process. Typically, during training of the neutral networks, the difference between the original image and video and the compressed and decompressed image and video is analyzed and the parameters of the neural networks are modified to reduce this difference while minimizing the data required to transmit the content.
- AI based compression methods may achieve poor compression results in terms of the appearance of the compressed image or video or the amount of information required to be transmitted.
- An example of an AI based image compression process comprising a hyper-network is described in Ballé, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv preprint arXiv:1802.01436 (2016), which is hereby incorporated by reference.
- An example of an AI based video compression approach is shown in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression.
- a method for lossy image or video encoding and transmission, and decoding comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; w ith a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image; wherein the first image and the second image comprise data arranged in a respective plurality of image channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises using a subset of the plurality of channels.
- the subset comprises a single channel of the plurality of channels of each of the first image and the second image.
- the plurality of channels of each of the first image and the second image comprise a luma channel and at least one chroma channel.
- the subset comprises the luma channel.
- the luma channel and the at least one chroma channel are defined in a YUV colour space.
- the at least one chroma channel has a different resolution to the luma channel.
- the YUV colour space comprises a YUV 4:2:0 or YUV 4:2:2 colour space.
- using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of optical flow information at a plurality of resolutions and using the representation of optical flow information at the plurality of resolutions to produce said latent representation of optical flow information.
- a representation of optical flow information at a first resolution of said plurality of resolutions is based on a representation of optical flow information at a second resolution of said plurality of resolutions.
- using the first image and the second image to produce a latent representation of optical flow information comprises: using the representation of optical flow information at one of said plurality of resolutions to warp a representation of the first image at a different resolution of said plurality of resolutions.
- a representation of optical flow information at a different one of said plurality of resolutions is based on the warped first image and the second image at one of said plurality of resolutions.
- the representation of optical flow information is estimated using the subset of the plurality of channels, and wherein the warping is performed using the plurality of channels.
- a method for lossy image or video encoding and transmission comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; wherein the first image and the second image comprise data arranged in a plurality of image channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises using a subset of the plurality of channels.
- a method for lossy image or video receipt and decoding comprising the steps of: receiving a latent representation of optical flow information at a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in a plurality of image channels, the latent representation of optical flow information being produced with first neural network using a subset of the plurality of channels; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image.
- data processing apparatus configured to perform any of the above methods.
- a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods.
- a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods.
- a method for lossy image or video encoding and transmission, and decoding comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; w ith a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image; wherein the first image and the second image comprise data arranged in respective luma channels and chroma channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of the optical flow information based on the respective luma channels of the
- said downsampling comprises downsamping the representation of the optical flow information to a resolution of the respective chroma channels of the first image and the second image.
- the method comprises warping the data in the respective chroma channels using the downsampled representation of the optical flow information.
- the method comprisies warping the data in the respective luma channels using the downsampled representation of the optical flow information.
- using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of optical flow information from the respective luma channels at a plurality of resolutions and using the representation of optical flow information at the plurality of resolutions to produce said latent representation of optical flow information.
- a representation of optical flow information at a first resolution of said plurality of resolutions is based on a representation of optical flow information at a second resolution of said plurality of resolutions.
- the representation of optical flow information at one or more of said plurality of resolutions is based on said warped data.
- the respective luma channels and chroma channels are defined in a YUV colour space.
- the respective chroma channels have a different resolution to the luma channel.
- the YUV colour space comprises a YUV 4:2:0 or YUV 4:2:2 colour space.
- a method for lossy image or video encoding and transmission comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; w herein the first image and the second image comprise data arranged in respective luma channels and chroma channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of the optical flow information based on the respective luma channels of the first image and the second image; and downsampling the representation of the optical flow information.
- a method for lossy image or video receipt and decoding comprising the steps of: receiving a latent representation of optical flow information to a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in respective luma channels and chroma channels, the latent representation of optical flow information being produced with a first neural network using a downsampled representation of the optical flow information based on the respective luma channels of the first image and the second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image.
- data processing apparatus configured to perform any of the above methods.
- a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods.
- a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods.
- a method for lossy image or video encoding and transmission, and decoding comprising the steps of: receiving a first image and a second image at a first computer system, the first image and the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; w ith a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image.
- the information comprises pixel values on said plurality of image channels.
- the plurality of image channels comprises a subset of channels in a first spatial resolution different to a spatial resolution of the other channels in the plurality of image channels.
- the transformed data comprises a plurality of image channels each having a same spatial resolution.
- said same spatial resolution is lower than the spatial resolution of said subset of channels.
- said same spatial resolution is lower than the spatial resolution of said other channels of the plurality of channels.
- the data comprises 3-channel YUV data and wherein said transforming comprises transforming the 3-channel YUV data into 24-channel data.
- the YUV data comprises one of 4:2:0 YUV data or 4:2:2 YUV data.
- the transformation comprises performing a pixel unshuffle operation on the data.
- the pixel unshuffle operation is defined by a first block size for a first image channel of the data, and defined by a second block size for a second image channel of the data .
- the transformation comprises performing a convolution operation on the data.
- the convolution operation is defined by a first stride for a first image channel of the data, and defined by a second stride for a second image channel of the data.
- the transformation comprises upsampling a subset of said plurality of image channels to produce a plurality of image channels each having a same spatial resolution.
- one or more of the first, second or third neural networks is defined by a convolution operation, and wherein said transforming increases a receptive field of the convolution operation.
- a method for lossy image or video encoding and transmission comprising the steps of: receiving a first image and a second image at a first computer system, the first image and the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system.
- a method for lossy image or video receipt and decoding comprising the steps of: receiving a first image and a second image at a first computer system, the first image and the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; receiving a latent representation of optical flow information at a second computer system, the ; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image.
- a method for lossy image or video receipt and decoding comprising the steps of: receiving a latent representation of optical flow information at a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in a plurality of image channels, the latent representation of optical flow information being produced with a first neural network and by transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; w ith a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image.
- Figure 1 illustrates an example of an image or video compression, transmission and decom- pression pipeline.
- Figure 2 illustrates a further example of an image or video compression, transmission and decompression pipeline including a hyper-network.
- Figure 3 illustrates an example of a video compression, transmission and decompression pipeline.
- Figure 4 illustrates an example of a video compression, transmission and decompression system.
- Figure 5 illustrates an example of a video compression, transmission and decompression system.
- Figure 6 illustrates an example of a flow module of a video compression, transmission and decompression system.
- Figure 7a illustrates an example of a video compression, transmission and decompression system.
- Figure 7b illustrates an example of a video compression, transmission and decompression system.
- Figure 8 illustrates an example of a flow module of a video compression, transmission and decompression system.
- Figure 9 illustrates an example of a flow module of a video compression, transmission and decompression system.
- Compression processes may be applied to any form of information to reduce the amount of data, or file size, required to store that information.
- Image and video information is an example of information that may be compressed.
- the file size required to store the information, particularly during a compression process when referring to the compressed file, may be referred to as the rate.
- compression can be lossless or lossy. In both forms of compression, the file size is reduced. However, in lossless compression, no information is lost when the information is compressed and subsequently decompressed. This means that the original file storing the information is fully reconstructed during the decompression process. In contrast to this, in lossy compression information may be lost in the compression and decompression process and the reconstructed file may differ from the original file.
- Image and video files containing image and video data are common targets for compression.
- the input image may be represented as ⁇ .
- the data representing the image may be stored in a tensor of dimensions ⁇ ⁇ ⁇ ⁇ ⁇ , where ⁇ represents the height of the image, ⁇ represents the width of the image and ⁇ represents the number of channels of the image.
- Each ⁇ ⁇ ⁇ data point of the image represents a pixel value of the image at the corresponding location.
- Each channel ⁇ of the image represents a different component of the image for each pixel which are combined when the image file is displayed by a device.
- an image file may have 3 channels with the channels representing the red, green and blue component of the image respectively.
- the image information is stored in the RGB colour space, which may also be referred to as a model or a format.
- Other examples of colour spaces or formats include the CMKY and the YCbCr colour models.
- the channels of an image file are not limited to storing colour information and other information may be represented in the channels.
- a video may be considered a series of images in sequence, any compression process that may be applied to an image may also be applied to a video.
- Each image making up a video may be referred to as a frame of the video.
- the output image may differ from the input image and may be represented by ⁇ .
- the difference between the input image and the output image may be referred to as distortion or a difference in image quality.
- the distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way.
- An example of such a method is using the mean square error (MSE) between the pixels of the input image and the output image, but there are many other ways of measuring distortion, as will be known to the person skilled in the art.
- the distortion function may comprise a trained neural network.
- the rate and distortion of a lossy compression process are related. An increase in the rate may result in a decrease in the distortion, and a decrease in the rate may result in an increase in the distortion. Changes to the distortion may affect the rate in a corresponding manner.
- a relation between these quantities for a given compression technique may be defined by a rate-distortion equation.
- a neural network is an operation that can be performed on an input to produce an output.
- a neural network may be made up of a plurality of layers. The first layer of the network receives the input. One or more operations may be performed on the input by the layer to produce an output of the first layer. The output of the first layer is then passed to the next layer of the network which may perform one or more operations in a similar way. The output of the final layer is the output of the neural network.
- Each layer of the neural network may be divided into nodes. Each node may receive at least part of the input from the previous layer and provide an output to one or more nodes in a subsequent layer.
- Each node of a layer may perform the one or more operations of the layer on at least part of the input to the layer.
- a node may receive an input from one or more nodes of the previous layer.
- the one or more operations may include a convolution, a weight, a bias and an activation function.
- Convolution operations are used in convolutional neural networks. When a convolution operation is present, the convolution may be performed across the entire input to a layer. Alternatively, the convolution may be performed on at least part of the input to the layer.
- Each of the one or more operations is defined by one or more parameters that are associated with each operation.
- the weight operation may be defined by a weight matrix defining the weight to be applied to each input from each node in the previous layer to each node in the present layer.
- each of the values in the weight matrix is a parameter of the neural network.
- the convolution may be defined by a convolution matrix, also known as a kernel.
- one or more of the values in the convolution matrix may be a parameter of the neural network.
- the activation function may also be defined by values which may be parameters of the neural network.
- the parameters of the network may be varied during training of the network. Other features of the neural network may be predetermined and therefore not varied during training of the network.
- the number of layers of the network, the number of nodes of the network, the one or more operations performed in each layer and the connections between the layers may be predetermined and therefore fixed before the training process takes place. These features that are predetermined may be referred to as the hyperparameters of the network. These features are sometimes referred to as the architecture of the network.
- a training set of inputs may be used for which the expected output, sometimes referred to as the ground truth, is known.
- the initial parameters of the neural network are randomized and the first training input is provided to the network.
- the output of the network is compared to the expected output, and based on a difference between the output and the expected output the parameters of the network are varied such that the difference between the output of the network and the expected output is reduced.
- This process is then repeated for a plurality of training inputs to train the network.
- the difference between the output of the network and the expected output may be defined by a loss function.
- the result of the loss function may be calculated using the difference between the output of the network and the expected output to determine the gradient of the loss function.
- Back-propagation of the gradient descent of the loss function may be used to update the parameters of the neural network using the gradients ⁇ / ⁇ of the loss function.
- a plurality of neural networks in a system may be trained simultaneously through back-propagation of the gradient of the loss function to each network. In the context of image or video compression, this type of system, where simultaneous training with back-propagation through each element or the whole network architecture may be referred to as end-to-end, learned image or video compression.
- an end-to-end learned system learns itself during training what combination of parameters best achieves the goal of minimising the loss function.
- This approach is advantageous compared to systems that are not end-to-end learned because an end-to-end system has a greater flexibility to learn weights and parameters that might be counter-intuitive to someone handcrafting features.
- training or “learning” as used herein means the process of optimizing an artificial intelligence or machine learning model, based on a given set of data. This involves iteratively adjusting the parameters of the model to minimize the discrepancy between the model’s predictions and the actual data, represented by the above-described rate-distortion loss function.
- the training process may comprise multiple epochs.
- An epoch refers to one complete pass of the entire training dataset through the machine learning algorithm.
- the model’s parameters are updated in an effort to minimize the loss function. It is envisaged that multiple epochs may be used to train a model, with the exact number depending on various factors including the complexity of the model and the diversity of the training data.
- the training data may be divided into smaller subsets known as batches.
- the size of a batch referred to as the batch size, may influence the training process. A smaller batch size can lead to more frequent updates to the model’s parameters, potentially leading to faster convergence to the optimal solution, but at the cost of increased computational resources.
- a larger batch size involves fewer updates, which can be more computationally efficient but might converge slower or even fail to converge to the optimal solution.
- the learnable parameters are updated by a specified amount each time, determined by the learning rate.
- the learning rate is a hyperparameter that decides how much the parameters are adjusted during the training process.
- a smaller learning rate implies smaller steps in the parameter space and a potentially more accurate solution, but it may require more epochs to reach that solution.
- a larger learning rate can expedite the training process but may risk overshooting the optimal solution or causing the training process to diverge.
- the specific details, such as hyper parameters and so on, of the training process may vary and it is envisaged that producing a trained model in this way may achieved in a number of different ways with different epochs, batch sizes, learning rates, regularisations, and so on, the details of which are not essential to enabling the advantages and effects of the present disclosure, except where stated otherwise.
- the point at which an “untrained” neural network is considered be “trained” is envisaged to be case specific and depend on, for example, on a number of epochs, a plateauing of any further learning, or some other metric and is not considered to be essential in achieving the advantages described herein. More details of an end-to-end, learned compression process will now be described.
- the loss function may be defined by the rate distortion equation.
- ⁇ may be referred to as a lagrange multiplier.
- the langrange multiplier provides as weight for a particular term of the loss function in relation to each other term and can be used to control which terms of the loss function are favoured when training the network.
- a training set of input images may be used.
- An example training set of input images is the KODAK image set (for example at www.cs.albany.edu/ xypan/research/snr/Kodak.html).
- An example training set of input images is the IMAX image set.
- An example training set of input images is the Imagenet dataset (for example at www.image-net.org/download).
- An example training set of input images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at http://challenge.compression.cc/tasks/).
- An example of an AI based compression, transmission and decompression process 100 is shown in Figure 1. As a first step in the AI based compression process, an input image 5 is provided.
- the input image 5 is provided to a trained neural network 110 characterized by a function ⁇ acting as an encoder.
- the encoder neural network 110 produces an output based on the input image. This output is referred to as a latent representation of the input image 5.
- the latent representation is quantised in a quantisation process 140 characterised by the operation ⁇ , resulting in a quantized latent.
- the quantisation process transforms the continuous latent representation into a discrete quantized latent.
- An example of a quantization process is a rounding function.
- the quantized latent is entropy encoded in an entropy encoding process 150 to produce a bitstream 130.
- the entropy encoding process may be for example, range or arithmetic encoding.
- the bitstream 130 may be transmitted across a communication network.
- the bitstream is entropy decoded in an entropy decoding process 160.
- the quantized latent is provided to another trained neural network 120 characterized by a function ⁇ acting as a decoder, which decodes the quantized latent.
- the trained neural network 120 produces an output based on the quantized latent.
- the output may be the output image of the AI based compression process 100.
- the encoder-decoder system may be referred to as an autoencoder.
- Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly compress given input data up to close to the fundamental entropy limit of that data, as determined by the total entropy of the distribution of that data. Accordingly, one way in which end-to-end, learned compression can minimise the rate loss term of the rate-distortion loss function and thereby increase compression effectiveness is to learn autoencoder parameter values that produce low entropy latent representation distributions. Producing latent representations distributed with as low an entropy as possible allows entropy encoding to compress the latent distributions as close to or to the fundamental entropy limit for that distribution.
- this learning may comprise learning optimal location and scale parameters of the gaussian or Laplacian distributions, in other cases, it allows the learning of more flexible latent representation distributions which can further help to achieve the minimising of the rate-distortion loss function in ways that are not intuitive or possible to do with handcrafted features. Examples of these and other advantages are described in WO2021/220008A1, which is incorporated in its entirety by reference.
- a rounding function may be used to quantise a latent representation distribution into bins of given sizes, a rounding function is not differentiable everywhere. Rather, a rounding function is effectively one or more step functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary between steps). Back propagating a gradient of a loss function through a rounding function is challenging. Instead, during training, quantisation by rounding function is replaced by one or more other approaches.
- a noise quantisation model are differentiable everywhere and accordingly do allow backpropagation of the gradient of the loss function through the quantisation parts of the end-to-end, learned system.
- a straight-through estimator (STE) quantisation model or one other quantisation models may be used.
- different quantisation models may be used for during evaluation of different term of the loss function.
- noise quantisation may be used to evaluate the rate or entropy loss term of the rate-distortion loss function while STE quantisation may be used to evaluate the distortion term.
- end-to-end learning of the quantisation process achieves a similar effect.
- learnable quantisation parameters provide the architecture with a further degree of freedom to achieve the goal of minimising the loss function. For example, parameters corresponding to quantisation bin sizes may be learned which is likely to result in an improved rate-distortion loss outcome compared to approaches using hand-crafted quantisation bin sizes. Further, as the rate-distortion loss function constantly has to balance a rate loss term against a distortion loss term, it has been found that the more degrees of freedom the system has during training, the better the architecture is at achieving optimal rate and distortion trade off.
- the system described above may be distributed across multiple locations and/or devices.
- the encoder 110 may be located on a device such as a laptop computer, desktop computer, smart phone or server.
- the decoder 120 may be located on a separate device which may be referred to as a recipient device.
- the system used to encode, transmit and decode the input image 5 to obtain the output image 6 may be referred to as a compression pipeline.
- the AI based compression process may further comprise a hyper-network 105 for the transmission of meta-information that improves the compression process.
- the hyper-network 105 comprises a trained neural network 115 acting as a hyper-encoder ⁇ h ⁇ and a trained neural network 125 acting as a hyper-decoder ⁇ h ⁇ .
- An example of such a system is shown in Figure 2. Components of the system not further discussed may be assumed to be the same as discussed above.
- the neural network 115 acting as a hyper-decoder receives the latent that is the output of the encoder 110.
- the hyper-encoder 115 produces an output based on the latent representation that may be referred to as a hyper-latent representation.
- the hyper-latent is then quantized in a quantization process 145 characterised by ⁇ h to produce a quantized hyper-latent.
- the quantization process 145 characterised by ⁇ h may be the same as the quantisation process 140 characterised by ⁇ discussed above.
- the quantized hyper-latent is then entropy encoded in an entropy encoding process 155 to produce a bitstream 135.
- the bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the quantized hyper-latent.
- the quantized hyper-latent is then used as an input to trained neural network 125 acting as a hyper-decoder.
- the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder 115. Instead, the output of the hyper-decoder is used to provide parameters for use in the entropy encoding process 150 and entropy decoding process 160 in the main compression process 100.
- the output of the hyper-decoder 125 can include one or more of the mean, standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation.
- the output of the hyper-decoder 125 can include one or more of the mean, standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation.
- the mean standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation.
- the decompression process usually takes place on a separate device, duplicates of these processes will be present on the device used for encoding to provide the parameters to be used in the entropy encoding process 150.
- Further transformations may be applied to at least one of the latent and the hyper-latent at any stage in the AI based compression process 100.
- At least one of the latent and the hyper latent may be converted to a residual value before the entropy encoding process 150,155 is performed.
- the residual value may be determined by subtracting the mean value of the distribution of latents or hyper-latents from each latent or hyper latent.
- the residual values may also be normalised.
- a training set of input images may be used as described above.
- the parameters of both the encoder 110 and the decoder 120 may be simultaneously updated in each training step. If a hyper-network 105 is also present, the parameters of both the hyper-encoder 115 and the hyper-decoder 125 may additionally be simultaneously updated in each training step.
- the training process may further include a generative adversarial network (GAN).
- GAN generative adversarial network
- an additional neutral network acting as a discriminator is included in the system.
- the discriminator receives an input and outputs a score based on the input providing an indication of whether the discriminator considers the input to be ground truth or fake.
- the indicator may be a score, with a high score associated with a ground truth input and a low score associated with a fake input.
- a loss function is used that maximizes the difference in the output indication between an input ground truth and input fake.
- the output of the discriminator may then be used in the loss function of the compression process as a measure of the distortion of the compression process.
- the discriminator may receive both the input image 5 and the output image 6 and the difference in output indication may then be used in the loss function of the compression process as a measure of the distortion of the compression process.
- Training of the neural network acting as a discriminator and the other neutral networks in the compression process may be performed simultaneously.
- the discriminator neural network is removed from the system and the output of the compression pipeline is the output image 6. Incorporation of a GAN into the training process may cause the decoder 120 to perform hallucination.
- Hallucination is the process of adding information in the output image 6 that was not present in the input image 5.
- hallucination may add fine detail to the output image 6 that was not present in the input image 5 or received by the decoder 120.
- the hallucination performed may be based on information in the quantized latent received by decoder 120. Details of a video compression process will now be described. As discussed above, a video is made up of a series of images arranged in sequential order. AI based compression process 100 described above may be applied multiple times to perform compression, transmission and decompression of a video. For example, each frame of the video may be compressed, transmitted and decompressed individually. The received frames may then be grouped to obtain the original video.
- the frames in a video may be labelled based on the information from other frames that is used to decode the frame in a video compression, transmission and decompression process.
- frames which are decoded using no information from other frames may be referred to as I-frames.
- Frames which are decoded using information from past frames may be referred to as P-frames.
- Frames which are decoded using information from past frames and future frames may be referred to as B-frames.
- Frames may not be encoded and/or decoded in the order that they appear in the video. For example, a frame at a later time step in the video may be decoded before a frame at an earlier time.
- the images represented by each frame of a video may be related.
- a number of frames in a video may show the same scene.
- a number of different parts of the scene may be shown in more than one of the frames.
- objects or people in a scene may be shown in more than one of the frames.
- the background of the scene may also be shown in more than one of the frames. If an object or the perspective is in motion in the video, the position of the object or background in one frame may change relative to the position of the object or background in another frame.
- the transformation of a part of the image from a first position in a first frame to a second position in a second frame may be referred to as flow, warping or motion compensation.
- the flow may be represented by a vector.
- One or more flows that represent the transformation of at least part of one frame to another frame may be referred to as a flow map.
- An example AI based video compression, transmission, and decompression process 200 is shown in Figure 3.
- the process 200 shown in Figure 3 is divided into an I-frame part 201 for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be understood that these divisions into different parts are arbitrary and the process 200 may be also be considered as a single, end-to-end pipeline.
- I-frames do not rely on information from other frames so the I-frame part 201 corresponds to the compression, transmission, and decompression process illustrated in Figures 1 or 2.
- an input image ⁇ 0 is passed into an encoder neural network 203 producing a latent representation which is quantised and entropy encoded into a bitstream 204.
- the bitstream 204 is then entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed image ⁇ 0 which in this case is an I-frame.
- the decoding step may be performed both locally at the same location as where the input image compression occurs as well as at the location where the decompression occurs. This allows the reconstructed image ⁇ 0 to be available for later use by components of both the encoding and decoding sides of the pipeline.
- P-frames (and B-frames) do rely on information from other frames. Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only the input image ⁇ that is to be compressed (corresponding to a frame of a video stream at position t), but also one or more previously reconstructed images ⁇ 1 from an earlier frame t-1.
- the previously reconstructed ⁇ 1 is available at both the encode and decode side of the pipeline and can accordingly be used for various purposes at both the encode and decode sides.
- previously reconstructed images may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames.
- both the image being compressed ⁇ and the previously reconstructed image from an earlier frame ⁇ 1 are passed into a flow module part 206 of the pipeline.
- the flow module part 206 comprises an autoencoder such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 207 has been trained to produce a latent representation of a flow map from inputs ⁇ 1 and ⁇ , which is indicative of inter-frame movement of pixels or pixel groups between ⁇ 1 and ⁇ .
- the latent representation of the flow map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208.
- the bitstream is entropy decoded and passed to a decoder neural network 209 to produce a reconstructed flow map ⁇ .
- the reconstructed flow map ⁇ is applied to the previously reconstructed image ⁇ 1 to generate a warped image ⁇ 1, ⁇ .
- any suitable warping technique may be used, for example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further envisaged that a scale-space flow approach as described in the above paper may also optionally be used.
- the warped image ⁇ 1, ⁇ is a prediction of how the previously reconstructed image ⁇ 1 might have changed between frame positions t-1 and t, based on the output flow map produced by the flow module part 206 autoencoder system from the inputs of ⁇ and ⁇ 1.
- the reconstructed flow map ⁇ and corresponding warped image ⁇ 1, ⁇ may be produced both on the encode side and the decode side of the pipeline so they are available for use by other components of the pipeline on both the encode and decode sides.
- both the image being compressed ⁇ and the ⁇ 1, ⁇ are passed into a residual module part 210 of the pipeline.
- the residual module part 210 comprises an autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 211 has been trained to produce a latent representation of a residual map indicative of differences between the input mage ⁇ and the warped image ⁇ 1, ⁇ .
- the latent representation of the residual map is then quantised and entropy encoded into a bitstream 212 and transmitted.
- the bitstream 212 is then entropy decoded and passed into a decoder neural network 213 which reconstructs a residual map ⁇ from the decoded latent representation.
- a residual map may first be pre-calculated between ⁇ and the ⁇ 1, ⁇ and the pre-calculated residual map may be passed into an autoencoder for compression only.
- the residual map ⁇ is applied (e.g. combined by addition, subtraction or a different operation) to the warped image to produce a reconstructed image ⁇ which is a reconstruction of image ⁇ and accordingly corresponds to a P-frame at position t in a sequence of frames of a video stream. It will be appreciated that the reconstructed image ⁇ can then be used to process the next frame. That is, it can be used to compress, transmit and decompress ⁇ +1, and so on until an entire video stream or chunk of a video stream has been processed.
- the residual autoencoder may be trained to reconstruct the frame ⁇ directly from the entropy decoded bitstream by removing the connection between ⁇ 1, ⁇ and the output of the residual block 210, thereby eliminating any direct combination step with the warped previously decoded image to speed up inference.
- the flow information is intuitively understood to be indirectly captured within the residual information, which the residual decoder is able to learn to use to directly reconstruct the output image ⁇ .
- the residual autoencoder may be trained to reconstruct the frame ⁇ directly from the entropy decoded bitstream in combination with some representation of flow injected into one or more layers of the residual decoder.
- the bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual map of each P-frame image.
- any of the autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those described in connection with Figure 2.
- the bitstream may also contain hyper and hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so on, of those networks as applicable.
- the above approach may generally also be extended to B-frames, for example as is described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6680-6689).
- the above-described flow and residual based approach is highly effective at reducing the amount of data that is to be transmitted because, as long as at least one reconstructed frame (e.g.
- FIG. 4 shows an example of an AI image or video compression process such as that described above in connection with Figures 1-3 implemented in a video streaming system 400.
- the system 400 comprises a first device 401 and a second device 402.
- the first and second devices 401, 402 may be user devices such as smartphones, tablets, AR/VR headsets or other portable devices.
- the system 400 of Figure 4 performs inference on a CPU of the first and second devices respectively.
- the CPU of first and second devices 401, 402 may comprise a Qualcomm Snapdragon CPU.
- the first device 401 comprises a media capture device 403, such as a camera, arranged to capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404.
- the video stream 404 is passed to a pre-processing module 406 which splits the video stream into blocks of frames, various frames of which will be designated as I-frames, P-frames, and/or B-frames.
- the blocks of frames are then compressed by an AI-compression module 407 comprising the encode side of the AI-based video compression pipeline of Figure 3.
- the output of the AI-compression module is accordingly a bitstream 408a which is transmitted from the first device 401, for example via a communications channel, for example over one or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409 communications.
- the second device 402 receives the communicated bitstream 408b which is passed to an AI-decompression module 410 comprising the decode side of the AI-based video compression pipeline of Figure 3.
- the output of the AI-decompression module 402 is the reconstructed I-frames, P-frames and/or B-frames which are passed to a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402.
- a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402.
- the system 400 of Figure 4 may be used for live video streaming at 30fps of a 1080p video stream, which means a cumulative latency of both the encode and decode side is below substantially 50ms, for example substantially 30ms or less. Achieving this level of runtime performance with only CPU compute on user devices presents challenges which are not addressed by known methods and systems or in the wider AI-compression literature.
- execution of different parts of the compression pipeline during inference may be optimized by adjusting the order in which operations are performed using one or more known CPU scheduling methods.
- Efficient scheduling can allow for operations to be performed in parallel, thereby reducing the total execution time.
- efficient management of memory resources may be implemented, including optimising caching methods such as storing frequently-accessed data in faster memory locations, and memory reuse, which minimizes memory allocation and deallocation operations.
- RGB Red, Green, Blue
- YUV YUV
- RGB is an additive color model that represents colors as a combination of red, green, and blue light.
- Each color channel is typically represented by an 8-bit value, ranging from 0 to 255. The combination of these three channels allows for the representation of a wide range of colors.
- a color is represented as a triplet (R, G, B), where R, G, and B are the intensity values of the red, green, and blue components, respectively.
- YUV Color Space In contrast, YUV separates the color information into two components: luminance (Y) and chrominance (U and V). The YUV color space takes advantage of the human visual system’s sensitivity to brightness and color information. Luminance (Y) The luminance (or Luma) component (Y) represents the brightness or intensity of a pixel.
- ⁇ ⁇ 1 ⁇ + ⁇ 2 ⁇ + ⁇ 3 ⁇
- ⁇ 1, ⁇ 2, ⁇ 3 are weights assigned to each color channel based on the human eye’s sensitivity to different wavelengths of light.
- the green channel is given the highest weight because the human eye is most sensitive to green light.
- the chrominance components (U and V) represent the color information in the YUV color space.
- the BT.601 standard also known as CCIR 601
- CCIR 601 provides recommendations for standard-definition digital video. It specifies the digitization of the YUV color space components using an 8-bit representation, essential for converting analog video signals into a digital format for television broadcasts, DVDs, and early streaming video formats.
- the Y, U, and V components are represented by 8-bit values, allowing for 256 levels of intensity ranging from 0 to 255.
- This quantization process converts the continuous range of YUV values into discrete levels suitable for digital storage and processing.
- the BT.601 standard defines the conversion formulas from RGB to YUV in a digital context, taking into account digital system characteristics and the need for efficient color information encoding.
- ⁇ 128 + (112.0 ⁇ ⁇ ⁇ 93.786 ⁇ ⁇ ⁇ 18.214 ⁇ ⁇ )
- R, G and B are the digital values of the red, green, and blue components, normalized to the range of 0 to 1 (e.g., a value of 255 in an 8-bit system is represented as 1.0).
- the constants include offsets (16 for ⁇ and 128 for ⁇ and ⁇ ) to center the chrominance components, and scaling factors to adjust the amplitude of the signals within the 8-bit range. More generally, in YUV color spaces, the U component represents the difference between the blue channel and the luminance, while the V component represents the difference between the red channel and the luminance. By separating the luminance and chrominance information, YUV allows for more effective compression techniques because the human visual system is more sensitive to changes in brightness than changes in color, so the luminance component can be compressed with less loss of perceived quality compared to the chrominance components. YUV color space also allows for subsampling of the chrominance components to further reduce the amount of data required for video transmission.
- Subsampling involves reducing the spatial resolution of the U and V components while preserving the full resolution of the luminance component.
- Known subsampling schemes include: 4:4:4 - No subsampling.
- the U and V components have the same resolution as the luminance component. 4:2:2 -
- the U and V components are subsampled horizontally by a factor of 2.
- the U and V components are subsampled both horizontally and vertically by a factor of 2.
- Subsampling reduces the amount of color information without significantly impacting the perceived visual quality, as the human eye is less sensitive to color details than brightness details.
- FIG. 5 shows flow encoder/decoder networks and residual encoder/decoder networks of an AI based compression pipeline.
- the flow encoder neural network takes a current image ⁇ and a previous image ⁇ 1, encodes these into a flow latent representation which is optionally quantised and entropy encoded and transmitted as a bit stream.
- the bitstream is received, entropy decoded into the flow latent representation that the flow decoder neural network uses as input to produce a representation of flow ⁇ , e.g. a flow map.
- the flow map ⁇ is applied to a previously decoded image ⁇ 1 to generate a warped version of that previously decoded image ⁇ 1, ⁇ .
- the warped version of the previously decoded image ⁇ 1, ⁇ is then fed into the residual encoder network, together with the current image ⁇ to produce a residual latent representation which is optionally quantised and entropy encoded and transmitted as a bit stream.
- the bitstream is received and entropy decoded back into the residual latent representation ⁇ and used by the residual decoder neural network, in combination with information from the warped version of the previously decoded image ⁇ 1, ⁇ , to produce the reconstructed image ⁇ .
- the information associated with the warped previously decoded image ⁇ 1, ⁇ is optionally first processed by module ⁇ , referred to herein after as a composition adapter, for example to downsample and/or pad it before it is fed into the residual decoder together with the entropy decoded residual latent representation to produce the final reconstructed image ⁇ .
- This process may then be repeated for ⁇ +1 and so on to encode, transmit, and decode a sequence of frames.
- An example implementation of the flow encoder neural network 207 is shown in Figure 6.
- Figure 6 illustrates an example of a flow module, in this case a network 600, configured to estimate information indicative of a difference between an image ⁇ 1 and an image ⁇ , e.g. flow information.
- the network 600 comprises a set of layers 601a, 601b respectively for an image ⁇ 1 and an image ⁇ from respective times or positions in a sequence ⁇ ⁇ 1 and ⁇ of a sequence of image frames.
- the set of layers 601a, 601b may define one or more convolution operations and/or nonlinear activations in the layers to sequentially downsample the input images to produce a pyramid of feature maps for different levels of coarseness or spatial resolution.
- This may comprise performing h/2 ⁇ /2 downsampling in a first layer, h/4 ⁇ /4 downsampling in a second layer h/8 ⁇ /8 downsampling in a third layer, h/16 ⁇ /16 downsampling in a fourth layer, h/32 ⁇ /32 downsampling in a fifth layer, h/64 ⁇ /64 downsampling in a sixth layer, and so on. It will of course be appreciated that these downsampling operations and levels of coarseness or spatial resolution of a pyramid feature map are exemplary only and others levels are also envisaged.
- a first cost volume 602 is calculated at the most course level between the pixels of the first image ⁇ 1 and the corresponding pixels of in the second image ⁇ .
- Cost volumes define the matchmaking cost of matching the pixels in the initial image with the pixels in the later image. That is, the closeness of each pixel, or a subset of all pixels, in the initial image to one or more pixels in the later image is determined with a measure of closeness such as a vector or dot product, a cosine similarity, a mean absolute difference, or some other measure of closeness.
- a first flow 603 can be estimated from the first cost volume 602. This may be achieved using, for example a flow extractor network which may comprise a convolutional neural network comprising a plurality of layers trained to output a tensor defining a flow map from the input cost volumes. Other methods of calculating flow information from cost volumes will also be known to the skilled person.
- weights and/or biases of any activation layers in network 600 are trainable parameters and can accordingly be updated during training either alone, or in an end to end manner with the rest of the compression pipeline.
- the trainable nature of these parameters provides the network 600 with flexibility to produce feature maps at each level of spatial resolution (i.e.
- pyramid feature maps and/or at the flow outputs that are forced into a distribution that best allows the network to meet its training objective (e.g. better compression, better reconstruction accuracy, more accurate reconstruction of flow, and so on).
- it allows the network 600 to produce feature maps that, when cost volumes and/or flow are calculated therefrom, produce cost volumes or flows that are distributed roughly matching the latent representation distribution that would previously have been expected to be output by a dedicated flow encoder module. This effectively allows a dedicated flow encoder to be omitted entirely from the flow compression part of the pipeline, as is shown in the illustrative example of Figure 7a, descibed later.
- the flow of the previous level or levels of coarseness or resolution may be used to warp 608, 609, the feature maps before the cost volume is calculated.
- This has the effect of artificially reducing the amount of relative movement between the pixels of the t and t - 1 images or feature maps when calculating the cost volumes, reducing flow errors for high movement details.
- the inventors have realized that removing warping entirely or in some levels of coarseness or resolution can substantially reduce runtime of flow calculation while maintaining good levels of flow accuracy.
- the flow estimation output may be upsampled 610, 611 first to match the coarseness resolution of the feature map to which the flow is being applied in the warping process.
- the outputs of the flow module may accordingly be one or more cost volumes or some representation thereof, and/or one or more flows or some representation thereof).
- cost volume the information indicative of differences between the respective inputs need not be a strict cost volume in the mathematical sense, but may be any representation of this information.
- a compressively calculated cost volume applying for example the principles of compressive sensing to estimate an approximate cost volume by sampling only a small number of pixel differences compared to performing a complete pixel-wise cost volume calculation.
- This compressively calculated cost volume approach may be applied to all embodiments described herein.
- running a flow-based compression pipeline once training has been completed relies on estimation of flow, whether by handcrafted algorithm or through some trained network.
- the output estimated flow itself may be compressed and transmitted in a bitstream, which is typically done by a dedicated flow encoder network that encodes the flow information into a latent representation distributed according to a distribution that can be efficiently entropy encoded.
- dedicated flow encoders increase run time and partly contribute to preventing learned compression codecs from running in real time or near real time.
- a previously reconstructed image and a new, to-be-encoded image may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames.
- both the current image being compressed ⁇ and the previously reconstructed image from an earlier frame ⁇ 1 are passed into a flow module 207 that typically has two parts: a flow estimation part, and a dedicated encoder part that encodes the estimated flows into a latent representation that can be efficiently entropy encoded.
- the dedicated flow encoder is effectively a specialized network dedicated to producing latent representations of flow that are distributed as close to an optimally entropy encodable distribution as possible. In Figure 3, these two parts are shown as a single component.
- the flow estimation part may comprise, for example, the flow module 600 of Figure 6.
- the flow module 600 produces its cost volumes and flows, then passes one or more of these into the second part: the dedicated flow encoder network that is trained to produce the latent representation of flow that can be efficiently entropy encoded before being sent in a bitstream to the decoder.
- the dedicated flow encoder network that is trained to produce the latent representation of flow that can be efficiently entropy encoded before being sent in a bitstream to the decoder.
- This approach is slow because we first have to calculate the cost volumes and flows before we can encode them into a latent representation, which itself is a slow process. We then do the entropy encoding of the latent representation of the cost volume(s) and/or flow(s) into a bitstream which is finally transmitted.
- This multi-step approach increases run time. Instead, in order reduce compute and runtime overhead, the present disclosure also envisages omitting the dedicated flow encoder.
- the outputs of the flow module 600 output(s) are directly entropy encoded and fed into the bitstream without first encoding them into a latent representation.
- the process of encoding flow module output(s) into a latent representation with a flow compression encoder is computationally expensive, removing this component entirely from the flow compression module 207 results in a significant decrease in runtime, thereby contributing to the goal of being able to run the pipeline in inference in real time or near real time.
- FIG. 7a illustratively shows two flow compression modules 700a and 700b which may be used as flow module 207 in Figure 3. The same reference numerals are used for like-components.
- two input images ⁇ and ⁇ 1 are passed 701a, 701b are input into a flow module such as flow module 600 of Figure 6, which produces pyramid feature maps 702 of different levels of coarseness, and corresponding cost volumes and/or flows 703.
- the final flow estimation is then passed to a dedicated flow compression encoder 704 which encodes it to produce a latent representation of the final flow estimation.
- the output may thus be a latent representation of one or more of a H W, H/2 W/2, H/4 W/4, H/8 W/8, H/16 W16. H/32 W/32, H/64 W/64, or some other resolution, cost volumes and/or flows. This is finally entropy encoded into a bitstream 705, and transmitted.
- the bitstream 705 is entropy decoded and the decoded bitstream is passed to a decoder 606 which reconstructs the cost volumes and/or flows which may be used in a flow-based approach as described above in the general concepts section.
- a decoder 606 which reconstructs the cost volumes and/or flows which may be used in a flow-based approach as described above in the general concepts section.
- the approach of the flow compression module 700a with a dedicated flow compression encoder 704 is slow and computationally expensive.
- the present inventors have realized that omitting the dedicated flow compression encoder 704 entirely and instead directly entropy encoding one or more outputs of the flow module 600 into the bitstream 705 without first passing it through the dedicated flow encoder 704 results in a substantial speed up at run time.
- the flow module 600 learns during training to compensate by outputting cost volumes and/or flows that are similarly distributed as those that would be output by a dedicated flow compression encoder 704.
- This can be understood as the flow module 600 (that is, the trainable networks within it such as the flow extractor network and/or activation layers in the convolution layers) effectively being forced to mimic the dedicated flow compression encoder 704 during training when the loss is being minimized because the system can no longer rely on the (now removed) dedicated flow compression encoder 704 to perform that task.
- the training of the flow module 600 may either be performed in an end-to-end manner together with the rest of the compression pipeline or alternatively in a student-teacher approach where the network of the flow module 600 is the student component, and a known optical flow model and pre-trained flow compression network is the teacher component. Additionally or alternatively, the training of the flow module 600 may be performed separately using data on which the groundtruth flow is known. For example, by using 3D animation video data whose groundtruth flow is known a priori through the animation program used to generate it, or using auto flow genreation methods.
- FIG. 7b illustrates the introduction of the flow compression modules 700b into a flow-residual compression pipeline, such as that of Figure 3, by illustratively showing the encoding and decoding of p-frames of an image stream, whereby a groundtruth image and its previously encoded and decoded reconstruction at t-1 are available.
- the output of the flow module 702 is natively in the distribution of a latent representation and can be immediately transmitted to the decoder, without needing a standalone dedicated flow compression encoder.
- the warped reconstructed, previously decoded image from t-1 may be passed directly to the residual decoder, in addition to whatever is output by the residual encoder 211. This provides the residual decoder with the additional context of the warped reconstructed, previously decoded image.
- the present inventors have realised that, when moving out of RGB space and into YUV space, the flow information is largely contained in the luma channel. Accordingly, flow can be estimated using the flow module operating only on the luma channel. Cutting the number of channels that get fed through the pyramid layers from three channels to single channel results in a substantial speed up with very little loss in overall performance in terms of image reconstruction accuracy (e.g. measured by distortion such as an MSE score) and compression performance (e.g. measured in in bits per pixel). Accordingly the present concept 1 is directed to replacing a multi-channel input tensor to the flow modules shown in Figures 6 to 7b with a single-channel tensor.
- FIG. 8 illustratively shows a modified flow module 800 corresponding to that of Figure 6, like numbered reference are used for like components. Additionally, a pre-processing module 801 is introduced before the flow module 800.
- the pre-processing module is configured to select the luma channel from the a YUV input to feed into the flow module 800, after which the flow module 800 operates as described above in relation to Figure 6.
- the pseudocode provided below illustratively shows an exemplary operation of the pre- processing module, configured to select a luma channel from an input in YUV format:
- Concept 2 Downsampled YUV warping
- warping is used to exploit temporal redundancy between consecutive frames.
- the goal of warping is to estimate and compensate for the motion of objects or regions within a video sequence, allowing for more efficient compression by reducing the amount of information that is to be encoded.
- a frame is typically divided into smaller blocks, such as macroblocks or coding units, which are then processed independently. Warping is performed on these blocks to find the best matching block in a reference frame, usually a preceding or succeeding frame, and to generate a motion vector that describes the displacement between the current block and its best match.
- the warping process can be described mathematically using a motion model.
- One commonly used motion model is the affine motion model, which assumes that the motion of a block can be represented by a linear transformation.
- the affine motion model is defined by a 2x3 matrix: ⁇ ⁇ ⁇ ⁇ ⁇ 11 ⁇ 12 ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ where ⁇ 11, ⁇ 12, ⁇ 21, ⁇ 22 represent the and shear parameters, and ⁇ , ⁇ represent the translation parameters.
- ⁇ ( ⁇ , ⁇ ) ⁇ ⁇ , ⁇ + ⁇ ) represents the pixel values of the candidate block in the reference frame shifted by ( ⁇ , ⁇ )
- ⁇ is the block size.
- the current block can be reconstructed by copying the pixels from the reference frame at the displaced location indicated by the motion vector. This process is known as motion compensation.
- the affine motion model has limitations in representing complex motion patterns, such as non-rigid or deformable objects. More advanced motion models, such as the projective motion model or the elastic motion model, can be employed in traditional compression as an alternative.
- the projective motion model for example, is defined by a 3x3 homography matrix: ⁇ ⁇ ⁇ ⁇ ′ ⁇ ⁇ ⁇ ⁇ h 11 h 12 h 13 ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 1 ⁇ ⁇
- ⁇ ′ is a scaling factor
- the final coordinates are obtained by dividing ⁇ ′ and ⁇ ′ by ⁇ ′.
- Elastic motion models such as the free-form deformation (FFD) model, allow for even more flexibility in representing complex motion.
- FFD models define a deformation grid over the image and use spline interpolation to compute the displacement of each pixel based on the grid points.
- the choice of motion model depends on the characteristics of the video content and the desired trade-off between compression efficiency and computational complexity.
- video compression standards such as H.264/AVC or H.265/HEVC, often use hierarchical motion estimation and compensation techniques, where the video frames are decomposed into a pyramid of resolutions, and motion estimation is performed at each level to capture both large-scale and small-scale motion.
- warping has a more indirect role: facilitating the estimation of a more accurate representation of flow at different resolutions using trained neural network pyramid layers, as shown in e.g. Figure 6, which in turn can facilitate the more accurate reconstruction by a fully neural network based residual encoder/decoder module, as shown in Figure 6b.
- the present inventors have realised that there are some improvements that can be made to AI-based flow estimation when working in YUV 4:2:2 and/or 4:2:0 space.
- the output flow representation that is fed into the warping step is in full-resolution.
- the final output that is eventually derived from the warped frame is to be assembled from the luma and chroma information. Accordingly, the chroma information still has to be reintroduced and recombined with the luma information in some way.
- the chroma information is missing entirely from the outputs of the flow estimation steps then it would have to be sent separately in the bitstream so that the decode side gets the chroma information one way or another. Instead of sending chroma information separately, it is combined again with the luma information during warping. Accordingly, the luma (Y) channel information from which flow is being estimated is recombined with the chroma (UV) channel information before warping is performed on the combined YUV information.
- the present inventors have realised that the overall effect on image reconstruction accuracy and compression performance that the warping step provides in the architecture of Figure 6 is approximately the same irrespective of whether the warping step is performed in the full-resolution of the luma (Y) channel or in the half-resolution of the chroma (UV) channels.
- performing warping in the half-resolution of the chroma (UV) channels is substantially faster computationally.
- the full-resolution luma (Y) channel information and any flow information derived therefrom can be downsampled to match the lower resolution of the (UV) channels before the warping step is performed in this lower resolution.
- the resulting warped, combined YUV information is then used to estimate the flow for the next resolution layer of the feature pyramid.
- FIG. 9 illustratively shows an example modified flow module 900 of the present disclosure demonstrating a non-limiting implementation example of concept 2.
- the input ⁇ 1 and ⁇ are pre-processed to select respectively the luma and chroma channels.
- the luma channel is fed into the feature pyramid layers as in Figure 6, but the chroma channels are instead fed directly into the warping steps 608, 609. For the downsampling pyramid layer where the native chroma channel resolution matches the luma channel resolution, no further downsampling of the chroma channel is performed.
- a downsampling step may be performed on the chroma channels so that the warping is performed using matching luma and chroma channel resolutions.
- the operation of the pre-processing modules 901a and 901b to select the respective luma and chroma channels may correspond to that described in relation to concept 1 above. It will further be appreciated that concepts 1 and 2 may also be combined alone or together with concept 3 described below.
- the shape of the tensor is accordingly more complicated and convolutions of the neural networks of the AI-based compression pipeline cannot be applied in a straightforward way to a non-uniformly shaped tensor.
- the present concept 3 is directed to solving this problem.
- One approach is to upsample the two chroma (UV) channels to match the resolution of the luma (Y) channel to produce a uniformly shaped input tensor (effectively pre-processing YUV 4:2:0 or YUV 4:2:0 into YUV 4:4:4 data) before performing any convolutions on the, now uniformly shaped, tensor.
- the method may comprise receiving an input tensor in YUV 4:2:0 format, this may be, for example, the format in the input data has been captured natively by an image capture device, whereby the input tensor has a height dimension and width dimension representing x-y coordinates in an image comprising a plurality of pixels, and whereby the input tensor has three channels made up of a luma channel (Y) and two chroma channels (UV).
- Y luma channel
- UV chroma channels
- the shape of the tensor is extracted and upsampling is performed using a scale factor of 2 on the U and V channels.
- the Y channel is left alone.
- the Y channel is then stacked with the now upsampled U and V channels and the resulting tensor is returned.
- the returned tensor now has a uniform shape where all the channels have the same resolution and accordingly convolutions can be performed on the tensor in the usual way.
- the upsampling may comprise one or more of: nearest neighbour interpolation, bilinear interpolation, bicubic interpolation, a transposed convolution (deconvolution) or any other upsampling technique known to the skilled person.
- One problem with this approach is that it can be slow. This is because (i) upsampling two whole channels requires additional computation time and (ii) the rest of the pipeline then operates in a higher resolution which means there are more computations overall (e.g. in the flow estimation, the warping, the residual estimation, and so on).
- This first approach is accordingly referred to as a naive approach as focuses on simplicity and does not take into account any synergies that YUV 4:2:0 may have with an AI-based compression pipeline.
- this approach is computationally slow and substantially increases run time as the entire compression pipeline would be running in the higher YUV 4:4:4 space.
- Another naive approach is to achieve matching tensor dimensions by introducing two or more additional convolution operations to an input layer of the pipeline that operate with one stride for the Y channel and a different stride for the UV channels.
- the shape of the tensor is identified and the respective convolution kernels for the Y channel and for the UV channels are identified.
- the Y, U and V channels are identified, the stride 4 convolution is performed on the Y channel, and the stride 2 convolution is performed on the U and V channels.
- the resulting convolved channels now have matching dimensions so can be summed to produce an overall uniformly shaped tensor which can be fed into the neural networks of the AI-compression pipeline in the usual way.
- This approach is an improvement over the first naive approach because the resulting tensor after the convolution does not have upsampled U and V channels so the runtime of the overall pipeline is faster.
- a space to depth pixel unshuffle operation with a block size of 4 is applied to the luma channel, and a space to depth pixel unshuffle operation with a block size of 2 is applied to each of the chroma channels. These are then stacked to produce the final tensor comprising 24 channels. That is, the space to depth pixel unshuffle operation with a block size of 4 takes the luma channel and applies a 4x4 block to the pixels of the luma channel which are distributed in the channel dimension to produce 16 channels.
- the space to depth pixel unshuffle operation with a block size of 2 takes the U channel and applies a 2x2 block to the pixels of the U channel and distributes these in the channel dimension to produce 4 channels, and the same applies to the V channel which produces 4 more channels to give an overall 24 channels.
- This approach results in two advantages over the above-described naive approaches. Firstly, the present inventors have realised that, on many hardware devices, it is faster to perform operations on data that has a smaller spatial resolution and larger channel dimension than it is to perform the same operations on larger spatial resolutions but smaller channel dimensions.
- the space to depth pixel unshuffle operations have the synergistic effect of increasing the receptive field of any subsequent convolution operations of the neural networks as they perform their respective convolutions on the input tensor (for example, in the flow encoder/decoder and/or residual encoder/decoder). That is, in a given convolution window, the output can only be influenced by whatever information is in the input.
- the receptive field is the single ⁇ grid of pixels of that channel. If we add additional channels to that convolution window, the output can now be influenced by the additional channel information. If we distribute pixels that originated from outside the ⁇ grid into one or more additional/new channels that are included in the convolution window, then we are allowing the output to be influenced by these new pixels from outside the ⁇ grid, thus indirectly increasing the receptive field of the convolution without needing to increase the spatial dimensions of the ⁇ grid.
- the increased receptive field allows the neural networks of the pipeline to harness spatial correlations between pixels better to thereby more efficiently learn what information is redundant and can be compressed away, thereby achieving improved compression rate and improved image reconstruction accuracy for a given bit rate.
- distributing the spatial dimension information into the channel dimensions using the applicable space to depth pixel unshuffle operations on the respective Y and UV channels solves the problem of applying convolutions throughout the pipeline to the non-uniformly shaped YUV 4:2:0 input tensor while, at the same time, the resulting 24-channel tensor achieves improved compression rates and image reconstruction accuracy in a way that is computationally efficient and achieves runtime speed ups of the order of milliseconds.
- the shape of the tensor is identified and the Y, U and V channels extracted.
- a pixel unshuffle operation with block size 4 is applied to the Y channel
- a pixel unshuffle operation with block size 2 is applies to the U channel
- a pixel unshuffle operation with block size 2 is applied to the V channel.
- the resulting 16 channel Y tensor, 4 channel U tensor and 4 channel V tensor are stacked to produce a 24 channel tensor of uniform shape which can be fed into the flow and/or residual modules of the AI-based compression pipeline as before. It will be appreciated that the above described block sizes and dimensions for the reshaping operations (e.g.
- the shape of the input tensor is determined and the number of spatial dimensions to be reshaped is calculated.
- An empty list, ⁇ h ⁇ is intialised to store the shape of the intermediate reshaped tensor.
- the remaining dimensions (non-spatial dimensions) from the input tensor are appended to the ⁇ h ⁇ list and we reshape the input tensor according to the ⁇ h ⁇ to obtain the ⁇ h ⁇ .
- the above generalised approach accordingly provides a framework for distributing pixels from the spatial dimension into channel dimensions to increase the receptive window of the subsequent convolution operations of the AI-based compression pipeline in a way that is compute efficient.
- the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus.
- the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
- the computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
- the computer storage medium is not, however, a propagated signal.
- data processing apparatus encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- the apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program may, but need not, correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code.
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit.
- a central processing unit will receive instructions and data from a read only memory or a random access memory or both.
- the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable devices, that includes one or more processors and computer readable media, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
- PDA personal digital assistant
- GPS Global Positioning System
- Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto optical disks e.g., CD ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in multiple examples separately or in any suitable subcombination.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
A method for lossy image or video encoding, transmission, and decoding, the method comprising the steps of: with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image; wherein the first image and the second image comprise data arranged in a respective plurality of image channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises using a subset of the plurality of channels.
Description
Method and data processing system for lossy image or video encoding, transmission and decoding BACKGROUND This invention relates to a method and system for lossy image or video encoding, transmission and decoding, a method, apparatus, computer program and computer readable storage medium for lossy image or video encoding and transmission, and a method, apparatus, computer program and computer readable storage medium for lossy image or video receipt and decoding. There is increasing demand from users of communications networks for images and video content. Demand is increasing not just for the number of images viewed, and for the playing time of video; demand is also increasing for higher resolution content. This places increasing demand on communications networks and increases their energy use because of the larger amount of data being transmitted. To reduce the impact of these issues, image and video content is compressed for transmission across the network. The compression of image and video content can be lossless or lossy compression. In lossless compression, the image or video is compressed such that all of the original information in the content can be recovered on decompression. However, when using lossless compression there is a limit to the reduction in data quantity that can be achieved. In lossy compression, some information is lost from the image or video during the compression process. Known compression techniques attempt to minimise the apparent loss of information by the removal of information that results in changes to the decompressed image or video that
is not particularly noticeable to the human visual system. JPEG, JPEG2000, AVC, HEVC and AVI are examples of compression processes for image and/or video files. In general terms, known lossy image compression techniques use the spatial correlations between pixels in images to remove redundant information during compression. For example, in an image of a blue sky, if a given pixel is blue, there is a high likelihood that the neighbouring pixels, and their neighbouring pixels, and so on, are also blue. There is accordingly no need to retain all the raw pixel data. Instead, we can retain only a subset of the pixels which take up fewer bits and infer the pixel values of the other pixels using information derived from spatial correlations. A similar approach is applied in known lossy video compression techniques. That is, spatial correlations between pixels allow the removal of redundant information during compression. However, in video compression, there is further information redundancy in the form of temporal correlations. For example, in a video of an aircraft flying across a blue-sky background, most of the pixels of the blue sky do not change at all between frames of the video. The most of the blue sky pixel data for the frame at position t = 0 in the video is identical to that at position t = 10. Storing this identical, temporally correlated, information is inefficient. Instead, only the blue sky pixel data for a subset of the frames is stored and the rest are inferred from information derived from temporal correlations. In the realm of lossy video compression in particular, the removal of redundant temporally correlated information in a video sequence is known inter-frame redundancy.
One technique using inter-frame redundancy that is widely used in standard video compression algorithms involves the categorization of video frames into three types: I-frames, P-frames, and B-frames. Each frame type carries distinct properties concerning their encoding and decoding process, playing different roles in achieving high compression ratios while maintaining acceptable visual quality. I-frames, or intra-coded frames, serve as the foundation of the video sequence. These frames are self-contained, each one encoding a complete image without reference to any other frame. In terms of compression, I-frames are least compressed among all frame types, thus carrying the most data. However, their independence provides several benefits, including being the starting point for decompression and enabling random access, crucial for functionalities like fast-forwarding or rewinding the video. P-frames, or predictive frames, utilize temporal redundancy in video sequences to achieve greater compression. Instead of encoding an entire image like an I-frame, a P-frame represents the difference between itself and the closest preceding I- or P-frame. The process, known as motion compensation, identifies and encodes only the changes that have occurred, thereby significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on previous frames for decoding. Consequently, any error during the encoding or transmission process may propagate to subsequent frames, impacting the overall video quality. B-frames, or bidirectionally predictive frames, represent the highest level of compression. Unlike P-frames, B-frames use both the preceding and following frames as references in their encoding process. By predicting motion both forwards and backwards in time, B-frames
encode only the differences that cannot be accurately anticipated from the previous and next frames, leading to substantial data reduction. Although this bidirectional prediction makes B-frames more complex to generate and decode, it does not propagate decoding errors since they are not used as references for other frames. Artificial intelligence (AI) based compression techniques achieve compression and decompression of images and videos through the use of trained neural networks in the compression and decompression process. Typically, during training of the neutral networks, the difference between the original image and video and the compressed and decompressed image and video is analyzed and the parameters of the neural networks are modified to reduce this difference while minimizing the data required to transmit the content. However, AI based compression methods may achieve poor compression results in terms of the appearance of the compressed image or video or the amount of information required to be transmitted. An example of an AI based image compression process comprising a hyper-network is described in Ballé, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv preprint arXiv:1802.01436 (2018), which is hereby incorporated by reference. An example of an AI based video compression approach is shown in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. A further example of an AI based video compression approach is shown in Mentzer, F., Agustsson, E., Ballé, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural
video compression using gans for detail synthesis and propagation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI (pp. 562-578), which is hereby incorporated by reference. SUMMARY According to an aspect of the present disclosure, there is provided a method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image; wherein the first image and the second image comprise data arranged in a respective plurality of image channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises using a subset of the plurality of channels.
Optionally, the subset comprises a single channel of the plurality of channels of each of the first image and the second image. Optionally, the plurality of channels of each of the first image and the second image comprise a luma channel and at least one chroma channel. Optionally, the subset comprises the luma channel. Optionally, the luma channel and the at least one chroma channel are defined in a YUV colour space. Optionally, the at least one chroma channel has a different resolution to the luma channel. Optionally, the YUV colour space comprises a YUV 4:2:0 or YUV 4:2:2 colour space. Optionally, using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of optical flow information at a plurality of resolutions and using the representation of optical flow information at the plurality of resolutions to produce said latent representation of optical flow information. Optionally, a representation of optical flow information at a first resolution of said plurality of resolutions is based on a representation of optical flow information at a second resolution of said plurality of resolutions.
Optionally, using the first image and the second image to produce a latent representation of optical flow information comprises: using the representation of optical flow information at one of said plurality of resolutions to warp a representation of the first image at a different resolution of said plurality of resolutions. Optionally, wherein a representation of optical flow information at a different one of said plurality of resolutions is based on the warped first image and the second image at one of said plurality of resolutions. Optionally, the representation of optical flow information is estimated using the subset of the plurality of channels, and wherein the warping is performed using the plurality of channels. According to an aspect, there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; wherein the first image and the second image comprise data arranged in a plurality of image channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises using a subset of the plurality of channels.
According to an aspect, there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation of optical flow information at a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in a plurality of image channels, the latent representation of optical flow information being produced with first neural network using a subset of the plurality of channels; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image. According to an aspect, there is provided data processing apparatus configured to perform any of the above methods. According to an aspect, there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods. According to an aspect, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods.
According to an aspect of the present disclosure, there is provided a method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image; wherein the first image and the second image comprise data arranged in respective luma channels and chroma channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of the optical flow information based on the respective luma channels of the first image and the second image; and downsampling the representation of the optical flow information. Optionally, said downsampling comprises downsamping the representation of the optical flow information to a resolution of the respective chroma channels of the first image and the second image.
Optionally, the method comprises warping the data in the respective chroma channels using the downsampled representation of the optical flow information. Optionally, the method comprisies warping the data in the respective luma channels using the downsampled representation of the optical flow information. Optionally, using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of optical flow information from the respective luma channels at a plurality of resolutions and using the representation of optical flow information at the plurality of resolutions to produce said latent representation of optical flow information. Optionally, a representation of optical flow information at a first resolution of said plurality of resolutions is based on a representation of optical flow information at a second resolution of said plurality of resolutions. Optionally, the representation of optical flow information at one or more of said plurality of resolutions is based on said warped data. Optionally, the respective luma channels and chroma channels are defined in a YUV colour space. Optionally, the respective chroma channels have a different resolution to the luma channel. Optionally, the YUV colour space comprises a YUV 4:2:0 or YUV 4:2:2 colour space.
According to an aspect, there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; wherein the first image and the second image comprise data arranged in respective luma channels and chroma channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of the optical flow information based on the respective luma channels of the first image and the second image; and downsampling the representation of the optical flow information. According to an aspect, there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation of optical flow information to a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in respective luma channels and chroma channels, the latent representation of optical flow information being produced with a first neural network using a downsampled representation of the optical flow information based on the respective
luma channels of the first image and the second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image. According to an aspect, there is provided data processing apparatus configured to perform any of the above methods. According to an aspect, there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods. According to an aspect, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods. According to an aspect, there is provided a method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system, the first image and the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information
using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image. Optionally, the information comprises pixel values on said plurality of image channels. Optionally, the plurality of image channels comprises a subset of channels in a first spatial resolution different to a spatial resolution of the other channels in the plurality of image channels. Optionally, the transformed data comprises a plurality of image channels each having a same spatial resolution. Optionally, said same spatial resolution is lower than the spatial resolution of said subset of channels. Optionally, said same spatial resolution is lower than the spatial resolution of said other channels of the plurality of channels,
Optionally, the data comprises 3-channel YUV data and wherein said transforming comprises transforming the 3-channel YUV data into 24-channel data. Optionally, the YUV data comprises one of 4:2:0 YUV data or 4:2:2 YUV data. Optionally, the transformation comprises performing a pixel unshuffle operation on the data. Optionally, the pixel unshuffle operation is defined by a first block size for a first image channel of the data, and defined by a second block size for a second image channel of the data . Optionally, the transformation comprises performing a convolution operation on the data. Optionally, the convolution operation is defined by a first stride for a first image channel of the data, and defined by a second stride for a second image channel of the data. Optionally, the transformation comprises upsampling a subset of said plurality of image channels to produce a plurality of image channels each having a same spatial resolution. Optionally, one or more of the first, second or third neural networks is defined by a convolution operation, and wherein said transforming increases a receptive field of the convolution operation. According to an aspect of the present disclsoure, there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving a first image and a second image at a first computer system, the first image and
the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system. According to an aspect of the present disclosure, there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system, the first image and the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; receiving a latent representation of optical flow information at a second computer system, the ; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and
with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image. According to an aspect of the present disclosure, there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation of optical flow information at a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in a plurality of image channels, the latent representation of optical flow information being produced with a first neural network and by transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image. According to an aspect, there is provided data processing apparatus configured to perform any of the above methods. According to an aspect, there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods.
According to an aspect, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods. BRIEF DESCRIPTION OF THE DRAWINGS Aspects of the invention will now be described by way of examples, with reference to the following figures in which: Figure 1 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 2 illustrates a further example of an image or video compression, transmission and decompression pipeline including a hyper-network. Figure 3 illustrates an example of a video compression, transmission and decompression pipeline. Figure 4 illustrates an example of a video compression, transmission and decompression system. Figure 5 illustrates an example of a video compression, transmission and decompression system.
Figure 6 illustrates an example of a flow module of a video compression, transmission and decompression system. Figure 7a illustrates an example of a video compression, transmission and decompression system. Figure 7b illustrates an example of a video compression, transmission and decompression system. Figure 8 illustrates an example of a flow module of a video compression, transmission and decompression system. Figure 9 illustrates an example of a flow module of a video compression, transmission and decompression system. DETAILED DESCRIPTION OF THE DRAWINGS Compression processes may be applied to any form of information to reduce the amount of data, or file size, required to store that information. Image and video information is an example of information that may be compressed. The file size required to store the information, particularly during a compression process when referring to the compressed file, may be referred to as the rate. In general, compression can be lossless or lossy. In both forms of compression, the file size is reduced. However, in lossless compression, no information is lost when the information is compressed and subsequently decompressed. This means that the original file storing the information is fully reconstructed during the decompression process.
In contrast to this, in lossy compression information may be lost in the compression and decompression process and the reconstructed file may differ from the original file. Image and video files containing image and video data are common targets for compression. In a compression process involving an image, the input image may be represented as ^^. The data representing the image may be stored in a tensor of dimensions ^^ × ^^ × ^^, where ^^ represents the height of the image, ^^ represents the width of the image and ^^ represents the number of channels of the image. Each ^^ × ^^ data point of the image represents a pixel value of the image at the corresponding location. Each channel ^^ of the image represents a different component of the image for each pixel which are combined when the image file is displayed by a device. For example, an image file may have 3 channels with the channels representing the red, green and blue component of the image respectively. In this case, the image information is stored in the RGB colour space, which may also be referred to as a model or a format. Other examples of colour spaces or formats include the CMKY and the YCbCr colour models. However, the channels of an image file are not limited to storing colour information and other information may be represented in the channels. As a video may be considered a series of images in sequence, any compression process that may be applied to an image may also be applied to a video. Each image making up a video may be referred to as a frame of the video. The output image may differ from the input image and may be represented by ^^. The difference between the input image and the output image may be referred to as distortion or a difference in image quality. The distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference
between input image and the output image in a numerical way. An example of such a method is using the mean square error (MSE) between the pixels of the input image and the output image, but there are many other ways of measuring distortion, as will be known to the person skilled in the art. The distortion function may comprise a trained neural network. Typically, the rate and distortion of a lossy compression process are related. An increase in the rate may result in a decrease in the distortion, and a decrease in the rate may result in an increase in the distortion. Changes to the distortion may affect the rate in a corresponding manner. A relation between these quantities for a given compression technique may be defined by a rate-distortion equation. AI based compression processes may involve the use of neural networks. A neural network is an operation that can be performed on an input to produce an output. A neural network may be made up of a plurality of layers. The first layer of the network receives the input. One or more operations may be performed on the input by the layer to produce an output of the first layer. The output of the first layer is then passed to the next layer of the network which may perform one or more operations in a similar way. The output of the final layer is the output of the neural network. Each layer of the neural network may be divided into nodes. Each node may receive at least part of the input from the previous layer and provide an output to one or more nodes in a subsequent layer. Each node of a layer may perform the one or more operations of the layer on at least part of the input to the layer. For example, a node may receive an input from one or more nodes of the previous layer. The one or more operations may include a convolution, a
weight, a bias and an activation function. Convolution operations are used in convolutional neural networks. When a convolution operation is present, the convolution may be performed across the entire input to a layer. Alternatively, the convolution may be performed on at least part of the input to the layer. Each of the one or more operations is defined by one or more parameters that are associated with each operation. For example, the weight operation may be defined by a weight matrix defining the weight to be applied to each input from each node in the previous layer to each node in the present layer. In this example, each of the values in the weight matrix is a parameter of the neural network. The convolution may be defined by a convolution matrix, also known as a kernel. In this example, one or more of the values in the convolution matrix may be a parameter of the neural network. The activation function may also be defined by values which may be parameters of the neural network. The parameters of the network may be varied during training of the network. Other features of the neural network may be predetermined and therefore not varied during training of the network. For example, the number of layers of the network, the number of nodes of the network, the one or more operations performed in each layer and the connections between the layers may be predetermined and therefore fixed before the training process takes place. These features that are predetermined may be referred to as the hyperparameters of the network. These features are sometimes referred to as the architecture of the network. To train the neural network, a training set of inputs may be used for which the expected output, sometimes referred to as the ground truth, is known. The initial parameters of the neural
network are randomized and the first training input is provided to the network. The output of the network is compared to the expected output, and based on a difference between the output and the expected output the parameters of the network are varied such that the difference between the output of the network and the expected output is reduced. This process is then repeated for a plurality of training inputs to train the network. The difference between the output of the network and the expected output may be defined by a loss function. The result of the loss function may be calculated using the difference between the output of the network and the expected output to determine the gradient of the loss function. Back-propagation of the gradient descent of the loss function may be used to update the parameters of the neural network using the gradients ^^^^/^^^^ of the loss function. A plurality of neural networks in a system may be trained simultaneously through back-propagation of the gradient of the loss function to each network. In the context of image or video compression, this type of system, where simultaneous training with back-propagation through each element or the whole network architecture may be referred to as end-to-end, learned image or video compression. Unlike in traditional compression algorithms that use primarily handcrafted, manually constructed steps, an end-to-end learned system learns itself during training what combination of parameters best achieves the goal of minimising the loss function. This approach is advantageous compared to systems that are not end-to-end learned because an end-to-end system has a greater flexibility to learn weights and parameters that might be counter-intuitive to someone handcrafting features.
It will be appreciated that the term "training" or "learning" as used herein means the process of optimizing an artificial intelligence or machine learning model, based on a given set of data. This involves iteratively adjusting the parameters of the model to minimize the discrepancy between the model’s predictions and the actual data, represented by the above-described rate-distortion loss function. The training process may comprise multiple epochs. An epoch refers to one complete pass of the entire training dataset through the machine learning algorithm. During an epoch, the model’s parameters are updated in an effort to minimize the loss function. It is envisaged that multiple epochs may be used to train a model, with the exact number depending on various factors including the complexity of the model and the diversity of the training data. Within each epoch, the training data may be divided into smaller subsets known as batches. The size of a batch, referred to as the batch size, may influence the training process. A smaller batch size can lead to more frequent updates to the model’s parameters, potentially leading to faster convergence to the optimal solution, but at the cost of increased computational resources. Conversely, a larger batch size involves fewer updates, which can be more computationally efficient but might converge slower or even fail to converge to the optimal solution. The learnable parameters are updated by a specified amount each time, determined by the learning rate. The learning rate is a hyperparameter that decides how much the parameters are adjusted during the training process. A smaller learning rate implies smaller steps in the parameter space and a potentially more accurate solution, but it may require more epochs to
reach that solution. On the other hand, a larger learning rate can expedite the training process but may risk overshooting the optimal solution or causing the training process to diverge. The training described herein may involve use of a validation set, which is a portion of the data not used in the initial training, which is used to evaluate the model’s performance and to prevent overfitting. Overfitting occurs when a model learns the training data too well, to the point that it fails to generalize to unseen data. Regularization techniques, such as dropout or L1/L2 regularization, can also be used to mitigate overfitting. It will be appreciated that training a machine learning model is an iterative process that may comprise selection and tuning of various parameters and hyperparameters. As will be appreciated, the specific details, such as hyper parameters and so on, of the training process may vary and it is envisaged that producing a trained model in this way may achieved in a number of different ways with different epochs, batch sizes, learning rates, regularisations, and so on, the details of which are not essential to enabling the advantages and effects of the present disclosure, except where stated otherwise. The point at which an “untrained” neural network is considered be “trained” is envisaged to be case specific and depend on, for example, on a number of epochs, a plateauing of any further learning, or some other metric and is not considered to be essential in achieving the advantages described herein. More details of an end-to-end, learned compression process will now be described. It will be appreciated that in some cases, end-to-end, learned compression processes may be combined with one or more components that are handcrafted or trained separately.
In the case of AI based image or video compression, the loss function may be defined by the rate distortion equation. The rate distortion equation may be represented by ^^^^^^^^ = ^^ + ^^ ∗ ^^, where ^^ is the distortion function, ^^ is a weighting factor, and ^^ is the rate loss. ^^ may be referred to as a lagrange multiplier. The langrange multiplier provides as weight for a particular term of the loss function in relation to each other term and can be used to control which terms of the loss function are favoured when training the network. In the case of AI based image or video compression, a training set of input images may be used. An example training set of input images is the KODAK image set (for example at www.cs.albany.edu/ xypan/research/snr/Kodak.html). An example training set of input images is the IMAX image set. An example training set of input images is the Imagenet dataset (for example at www.image-net.org/download). An example training set of input images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at http://challenge.compression.cc/tasks/). An example of an AI based compression, transmission and decompression process 100 is shown in Figure 1. As a first step in the AI based compression process, an input image 5 is provided. The input image 5 is provided to a trained neural network 110 characterized by a function ^^^^ acting as an encoder. The encoder neural network 110 produces an output based on the input image. This output is referred to as a latent representation of the input image 5. In a second step, the latent representation is quantised in a quantisation process 140 characterised by the operation ^^, resulting in a quantized latent. The quantisation process transforms the
continuous latent representation into a discrete quantized latent. An example of a quantization process is a rounding function. In a third step, the quantized latent is entropy encoded in an entropy encoding process 150 to produce a bitstream 130. The entropy encoding process may be for example, range or arithmetic encoding. In a fourth step, the bitstream 130 may be transmitted across a communication network. In a fifth step, the bitstream is entropy decoded in an entropy decoding process 160. The quantized latent is provided to another trained neural network 120 characterized by a function ^^^^ acting as a decoder, which decodes the quantized latent. The trained neural network 120 produces an output based on the quantized latent. The output may be the output image of the AI based compression process 100. The encoder-decoder system may be referred to as an autoencoder. Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly compress given input data up to close to the fundamental entropy limit of that data, as determined by the total entropy of the distribution of that data. Accordingly, one way in which end-to-end, learned compression can minimise the rate loss term of the rate-distortion loss function and thereby increase compression effectiveness is to learn autoencoder parameter values that produce low entropy latent representation distributions. Producing latent representations distributed with as low an entropy as possible allows entropy encoding to compress the latent distributions as close to or to the fundamental entropy limit for that distribution. The lower the entropy of the distribution, the more entropy encoding can losslessly compress it and the
lower the amount of data in the corresponding bitstream. In some cases where the latent representation is distributed according to a gaussian or Laplacian distribution, this learning may comprise learning optimal location and scale parameters of the gaussian or Laplacian distributions, in other cases, it allows the learning of more flexible latent representation distributions which can further help to achieve the minimising of the rate-distortion loss function in ways that are not intuitive or possible to do with handcrafted features. Examples of these and other advantages are described in WO2021/220008A1, which is incorporated in its entirety by reference. Something which is closely linked to the entropy encoding of the latent distribution and which accordingly also has an effect on the effectiveness of compression of end-to-end learned approaches is the quantisation step. During inference, a rounding function may be used to quantise a latent representation distribution into bins of given sizes, a rounding function is not differentiable everywhere. Rather, a rounding function is effectively one or more step functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary between steps). Back propagating a gradient of a loss function through a rounding function is challenging. Instead, during training, quantisation by rounding function is replaced by one or more other approaches. For example, the functions of a noise quantisation model are differentiable everywhere and accordingly do allow backpropagation of the gradient of the loss function through the quantisation parts of the end-to-end, learned system. Alternatively, a straight-through estimator (STE) quantisation model or one other quantisation models may be used. It is also envisaged that different quantisation models may be used for during evaluation of different term of the loss function. For example, noise quantisation may used to evaluate the
rate or entropy loss term of the rate-distortion loss function while STE quantisation may be used to evaluate the distortion term. In a similar manner to how learning parameters top produce certain distributions of the latent representation facilitates achieving better rate loss term minimisation, end-to-end learning of the quantisation process achieves a similar effect. That is, learnable quantisation parameters provide the architecture with a further degree of freedom to achieve the goal of minimising the loss function. For example, parameters corresponding to quantisation bin sizes may be learned which is likely to result in an improved rate-distortion loss outcome compared to approaches using hand-crafted quantisation bin sizes. Further, as the rate-distortion loss function constantly has to balance a rate loss term against a distortion loss term, it has been found that the more degrees of freedom the system has during training, the better the architecture is at achieving optimal rate and distortion trade off. The system described above may be distributed across multiple locations and/or devices. For example, the encoder 110 may be located on a device such as a laptop computer, desktop computer, smart phone or server. The decoder 120 may be located on a separate device which may be referred to as a recipient device. The system used to encode, transmit and decode the input image 5 to obtain the output image 6 may be referred to as a compression pipeline. The AI based compression process may further comprise a hyper-network 105 for the transmission of meta-information that improves the compression process. The hyper-network 105 comprises a trained neural network 115 acting as a hyper-encoder ^^ ℎ ^^ and a trained neural
network 125 acting as a hyper-decoder ^^ℎ ^^. An example of such a system is shown in Figure 2. Components of the system not further discussed may be assumed to be the same as discussed above. The neural network 115 acting as a hyper-decoder receives the latent that is the output of the encoder 110. The hyper-encoder 115 produces an output based on the latent representation that may be referred to as a hyper-latent representation. The hyper-latent is then quantized in a quantization process 145 characterised by ^^ℎ to produce a quantized hyper-latent. The quantization process 145 characterised by ^^ℎ may be the same as the quantisation process 140 characterised by ^^ discussed above. In a similar manner as discussed above for the quantized latent, the quantized hyper-latent is then entropy encoded in an entropy encoding process 155 to produce a bitstream 135. The bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the quantized hyper-latent. The quantized hyper-latent is then used as an input to trained neural network 125 acting as a hyper-decoder. However, in contrast to the compression pipeline 100, the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder 115. Instead, the output of the hyper-decoder is used to provide parameters for use in the entropy encoding process 150 and entropy decoding process 160 in the main compression process 100. For example, the output of the hyper-decoder 125 can include one or more of the mean, standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation. In the example shown in Figure 2, only a single entropy decoding process 165 and hyper-decoder 125 is shown for simplicity. However, in practice, as the decompression process usually takes place on a separate device, duplicates of these processes will be present
on the device used for encoding to provide the parameters to be used in the entropy encoding process 150. Further transformations may be applied to at least one of the latent and the hyper-latent at any stage in the AI based compression process 100. For example, at least one of the latent and the hyper latent may be converted to a residual value before the entropy encoding process 150,155 is performed. The residual value may be determined by subtracting the mean value of the distribution of latents or hyper-latents from each latent or hyper latent. The residual values may also be normalised. To perform training of the AI based compression process described above, a training set of input images may be used as described above. During the training process, the parameters of both the encoder 110 and the decoder 120 may be simultaneously updated in each training step. If a hyper-network 105 is also present, the parameters of both the hyper-encoder 115 and the hyper-decoder 125 may additionally be simultaneously updated in each training step. The training process may further include a generative adversarial network (GAN). When applied to an AI based compression process, in addition to the compression pipeline described above, an additional neutral network acting as a discriminator is included in the system. The discriminator receives an input and outputs a score based on the input providing an indication of whether the discriminator considers the input to be ground truth or fake. For example, the indicator may be a score, with a high score associated with a ground truth input and a low score associated with a fake input. For training of a discriminator, a loss function is used that maximizes the difference in the output indication between an input ground truth and input fake.
When a GAN is incorporated into the training of the compression process, the output image 6 may be provided to the discriminator. The output of the discriminator may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Alternatively, the discriminator may receive both the input image 5 and the output image 6 and the difference in output indication may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Training of the neural network acting as a discriminator and the other neutral networks in the compression process may be performed simultaneously. During use of the trained compression pipeline for the compression and transmission of images or video, the discriminator neural network is removed from the system and the output of the compression pipeline is the output image 6. Incorporation of a GAN into the training process may cause the decoder 120 to perform hallucination. Hallucination is the process of adding information in the output image 6 that was not present in the input image 5. In an example, hallucination may add fine detail to the output image 6 that was not present in the input image 5 or received by the decoder 120. The hallucination performed may be based on information in the quantized latent received by decoder 120. Details of a video compression process will now be described. As discussed above, a video is made up of a series of images arranged in sequential order. AI based compression process 100 described above may be applied multiple times to perform compression, transmission and decompression of a video. For example, each frame of the video may be compressed,
transmitted and decompressed individually. The received frames may then be grouped to obtain the original video. The frames in a video may be labelled based on the information from other frames that is used to decode the frame in a video compression, transmission and decompression process. As described above, frames which are decoded using no information from other frames may be referred to as I-frames. Frames which are decoded using information from past frames may be referred to as P-frames. Frames which are decoded using information from past frames and future frames may be referred to as B-frames. Frames may not be encoded and/or decoded in the order that they appear in the video. For example, a frame at a later time step in the video may be decoded before a frame at an earlier time. The images represented by each frame of a video may be related. For example, a number of frames in a video may show the same scene. In this case, a number of different parts of the scene may be shown in more than one of the frames. For example, objects or people in a scene may be shown in more than one of the frames. The background of the scene may also be shown in more than one of the frames. If an object or the perspective is in motion in the video, the position of the object or background in one frame may change relative to the position of the object or background in another frame. The transformation of a part of the image from a first position in a first frame to a second position in a second frame may be referred to as flow, warping or motion compensation. The flow may be represented by a vector. One or more flows that represent the transformation of at least part of one frame to another frame may be referred to as a flow map.
An example AI based video compression, transmission, and decompression process 200 is shown in Figure 3. The process 200 shown in Figure 3 is divided into an I-frame part 201 for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be understood that these divisions into different parts are arbitrary and the process 200 may be also be considered as a single, end-to-end pipeline. As described above, I-frames do not rely on information from other frames so the I-frame part 201 corresponds to the compression, transmission, and decompression process illustrated in Figures 1 or 2. The specific details will not be repeated here but, in summary, an input image ^^0 is passed into an encoder neural network 203 producing a latent representation which is quantised and entropy encoded into a bitstream 204. The subscript 0 in ^^0 indicates the input image corresponds to a frame of a video stream at position t = 0. This may be the first frame of an entire video stream or the first frame of a chunk of a video stream made up of, for example, an I-frame and a plurality of subsequent P-frames and/or B-frames. The bitstream 204 is then entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed image ^^0 which in this case is an I-frame. The decoding step may be performed both locally at the same location as where the input image compression occurs as well as at the location where the decompression occurs. This allows the reconstructed image ^^0 to be available for later use by components of both the encoding and decoding sides of the pipeline. In contrast to I-frames, P-frames (and B-frames) do rely on information from other frames. Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only the input image ^^^^ that is to be compressed (corresponding to a frame of a video stream at
position t), but also one or more previously reconstructed images ^^^^−1 from an earlier frame t-1. As described above, the previously reconstructed ^^^^−1 is available at both the encode and decode side of the pipeline and can accordingly be used for various purposes at both the encode and decode sides. At the encode side, previously reconstructed images may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames. In the example of Figure 3, both the image being compressed ^^^^ and the previously reconstructed image from an earlier frame ^^^^−1 are passed into a flow module part 206 of the pipeline. The flow module part 206 comprises an autoencoder such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 207 has been trained to produce a latent representation of a flow map from inputs ^^^^−1 and ^^^^ , which is indicative of inter-frame movement of pixels or pixel groups between ^^^^−1 and ^^^^ . The latent representation of the flow map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208. On the decode side, the bitstream is entropy decoded and passed to a decoder neural network 209 to produce a reconstructed flow map ^^ . The reconstructed flow map ^^ is applied to the previously reconstructed image ^^^^−1 to generate a warped image ^^^^−1,^^. It is envisaged that any suitable warping technique may be used, for example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further
envisaged that a scale-space flow approach as described in the above paper may also optionally be used. The warped image ^^^^−1,^^ is a prediction of how the previously reconstructed image ^^^^−1 might have changed between frame positions t-1 and t, based on the output flow map produced by the flow module part 206 autoencoder system from the inputs of ^^^^ and ^^^^−1. As with the I-frame, the reconstructed flow map ^^ and corresponding warped image ^^^^−1,^^ may be produced both on the encode side and the decode side of the pipeline so they are available for use by other components of the pipeline on both the encode and decode sides. In the example of Figure 3, both the image being compressed ^^^^ and the ^^^^−1,^^ are passed into a residual module part 210 of the pipeline. The residual module part 210 comprises an autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 211 has been trained to produce a latent representation of a residual map indicative of differences between the input mage ^^^^ and the warped image ^^^^−1,^^. The latent representation of the residual map is then quantised and entropy encoded into a bitstream 212 and transmitted. The bitstream 212 is then entropy decoded and passed into a decoder neural network 213 which reconstructs a residual map ^^ from the decoded latent representation. Alternatively, a residual map may first be pre-calculated between ^^^^ and the ^^^^−1,^^ and the pre-calculated residual map may be passed into an autoencoder for compression only. This hand-crafted residual map approach is computationally simpler, but reduces the degrees of freedom with which the architecture may learn weights and parameters to achieve its goal during training of minimising the rate-distortion loss function.
Finally, on the decode side, the residual map ^^ is applied (e.g. combined by addition, subtraction or a different operation) to the warped image to produce a reconstructed image ^^^^ which is a reconstruction of image ^^^^ and accordingly corresponds to a P-frame at position t in a sequence of frames of a video stream. It will be appreciated that the reconstructed image ^^^^ can then be used to process the next frame. That is, it can be used to compress, transmit and decompress ^^^^+1, and so on until an entire video stream or chunk of a video stream has been processed. Alternatively, the residual autoencoder may be trained to reconstruct the frame ^^^^ directly from the entropy decoded bitstream by removing the connection between ^^^^−1,^^ and the output of the residual block 210, thereby eliminating any direct combination step with the warped previously decoded image to speed up inference. In this case, the flow information is intuitively understood to be indirectly captured within the residual information, which the residual decoder is able to learn to use to directly reconstruct the output image ^^^^ . Alternatively, the residual autoencoder may be trained to reconstruct the frame ^^^^ directly from the entropy decoded bitstream in combination with some representation of flow injected into one or more layers of the residual decoder. In this case, the flow information is intuitively understood to be indirectly captured within the injected information, which the residual decoder is able to learn to use while decoding the latent representation of flow information to directly reconstruct the output image ^^^^ . Thus, for a block of video frames comprising an I-frame and ^^ subsequent P-frames, the bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual
map of each P-frame image. For completeness, whilst not illustrated in Figure 3, any of the autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those described in connection with Figure 2. Accordingly, the bitstream may also contain hyper and hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so on, of those networks as applicable. Finally, the above approach may generally also be extended to B-frames, for example as is described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6680-6689). The above-described flow and residual based approach is highly effective at reducing the amount of data that is to be transmitted because, as long as at least one reconstructed frame (e.g. I-frame ^^^^−1) is available, the encode side only compresses and transmits a flow map and a residual map (and any hyper or hyper-hyper parameter information, as applicable) to reconstruct a subsequent frame. Figure 4 shows an example of an AI image or video compression process such as that described above in connection with Figures 1-3 implemented in a video streaming system 400. The system 400 comprises a first device 401 and a second device 402. The first and second devices 401, 402 may be user devices such as smartphones, tablets, AR/VR headsets or other portable devices. In contrast to known systems which primarily perform inference on GPUs such as Nvidia A100, Geforce 3090, Gefore 4090 GPU cards, the system 400 of Figure 4 performs inference on a CPU of the first and second devices respectively. That is, compute
for performing both encoding and decoding are performed by the respective CPUs of the first and second devices 401, 402. This places very different power usage, memory and runtime constraints on the implementation of the above methods than when implementing AI-based compression methods on GPUs. In one example, the CPU of first and second devices 401, 402 may comprise a Qualcomm Snapdragon CPU. The first device 401 comprises a media capture device 403, such as a camera, arranged to capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404. The video stream 404 is passed to a pre-processing module 406 which splits the video stream into blocks of frames, various frames of which will be designated as I-frames, P-frames, and/or B-frames. The blocks of frames are then compressed by an AI-compression module 407 comprising the encode side of the AI-based video compression pipeline of Figure 3. The output of the AI-compression module is accordingly a bitstream 408a which is transmitted from the first device 401, for example via a communications channel, for example over one or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409 communications. The second device 402 receives the communicated bitstream 408b which is passed to an AI-decompression module 410 comprising the decode side of the AI-based video compression pipeline of Figure 3. The output of the AI-decompression module 402 is the reconstructed I-frames, P-frames and/or B-frames which are passed to a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402.
It is envisaged that the system 400 of Figure 4 may be used for live video streaming at 30fps of a 1080p video stream, which means a cumulative latency of both the encode and decode side is below substantially 50ms, for example substantially 30ms or less. Achieving this level of runtime performance with only CPU compute on user devices presents challenges which are not addressed by known methods and systems or in the wider AI-compression literature. For example, execution of different parts of the compression pipeline during inference may be optimized by adjusting the order in which operations are performed using one or more known CPU scheduling methods. Efficient scheduling can allow for operations to be performed in parallel, thereby reducing the total execution time. It is also envisaged that efficient management of memory resources may be implemented, including optimising caching methods such as storing frequently-accessed data in faster memory locations, and memory reuse, which minimizes memory allocation and deallocation operations. A number of concepts related to the AI compression processes and/or their implementation in a hardware system discussed above will now be described. Although each concept is described separately, one or more of the concepts described below may be applied in an AI based compression process as described above.
Concept 1: Single-channel flow estimation As described above, two commonly used color spaces in digital image processing are RGB (Red, Green, Blue) and YUV. While RGB is widely used in computer graphics and digital displays, YUV is frequently employed in video compression and transmission. In more detail, RGB is an additive color model that represents colors as a combination of red, green, and blue light. Each color channel is typically represented by an 8-bit value, ranging from 0 to 255. The combination of these three channels allows for the representation of a wide range of colors. In the RGB color space, a color is represented as a triplet (R, G, B), where R, G, and B are the intensity values of the red, green, and blue components, respectively. For example, (255, 0, 0) represents pure red, (0, 255, 0) represents pure green, and (0, 0, 255) represents pure blue. Black is represented as (0, 0, 0), while white is represented as (255, 255, 255). YUV Color Space In contrast, YUV separates the color information into two components: luminance (Y) and chrominance (U and V). The YUV color space takes advantage of the human visual system’s sensitivity to brightness and color information. Luminance (Y) The luminance (or Luma) component (Y) represents the brightness or intensity of a pixel. It is calculated as a weighted sum of the RGB components: ^^ = ^^1^^ + ^^2^^ + ^^3^^ where ^^1, ^^2, ^^3 are weights assigned to each color channel based on the human eye’s sensitivity to different wavelengths of light. The green channel is given the highest weight because the human eye is most sensitive to green light. A non-limiting example set of weights may be: ^^ = 0.299^^ + 0.587^^ + 0.114^^ The chrominance components (U and V) represent the color information in the YUV color space. They are calculated by subtracting the luminance value from the blue and red color channels, respectively, for example: ^^ = ^^ − ^^ ^^ = ^^ − ^^ YUV can also be defined in a digitised representation referred to as BT.601. The BT.601 standard, also known as CCIR 601, provides recommendations for standard-definition digital video. It specifies the digitization of the YUV color space components using an 8-bit representation, essential for converting analog video signals into a digital format for television broadcasts, DVDs, and early streaming video formats. In digital systems, the Y, U, and V components are represented by 8-bit values, allowing for 256 levels of intensity ranging from 0 to 255. This quantization process converts the continuous range of YUV values into discrete levels suitable for digital storage and processing. The BT.601 standard defines the conversion formulas from RGB to YUV in a digital context, taking into account digital system characteristics and the need for efficient color information encoding. In one illustrative example, the digital representation of the luminance (Y) component from RGB values is given by the formula: ^^ = 16 + (65.481 · ^^ + 128.553 · ^^ + 24.966 · ^^) The chrominance components may then calculated with the following formulas: ^^ = 128 + (−37.797 · ^^ − 74.203 · ^^ + 112.0 · ^^) ^^ = 128 + (112.0 · ^^ − 93.786 · ^^ − 18.214 · ^^) In these formulas, R, G and B are the digital values of the red, green, and blue components, normalized to the range of 0 to 1 (e.g., a value of 255 in an 8-bit system is represented as 1.0). The constants include offsets (16 for ^^ and 128 for ^^ and ^^) to center the chrominance components, and scaling factors to adjust the amplitude of the signals within the 8-bit range. More generally, in YUV color spaces, the U component represents the difference between the blue channel and the luminance, while the V component represents the difference between the red channel and the luminance. By separating the luminance and chrominance information, YUV allows for more effective compression techniques because the human visual system is more sensitive to changes in brightness than changes in color, so the luminance component can be compressed with less loss of perceived quality compared to the chrominance components. YUV color space also allows for subsampling of the chrominance components to further reduce the amount of data required for video transmission. Subsampling involves reducing the spatial resolution of the U and V components while preserving the full resolution of the luminance component. Known subsampling schemes include: 4:4:4 - No subsampling. The U and V components have the same resolution as the luminance component. 4:2:2 - The U and V components are subsampled horizontally by a factor of 2. 4:2:0 - The U and V components are subsampled both horizontally and vertically by a factor of 2. Subsampling reduces the amount of color information without significantly impacting the perceived visual quality, as the human eye is less sensitive to color details than brightness details. Consider next the illustrative flow-residual compression pipeline shown in Figure 5, which shows flow encoder/decoder networks and residual encoder/decoder networks of an AI based
compression pipeline. This may be, for example, similar to the AI based compression pipeline of the type shown in Figure 3. The flow encoder neural network takes a current image ^^^^ and a previous image ^^^^−1, encodes these into a flow latent representation which is optionally quantised and entropy encoded and transmitted as a bit stream. On the decode side, the bitstream is received, entropy decoded into the flow latent representation that the flow decoder neural network uses as input to produce a representation of flow ^^ , e.g. a flow map. The flow map ^^ is applied to a previously decoded image ^^^^−1 to generate a warped version of that previously decoded image ^^^^−1,^^ . The warped version of the previously decoded image ^^^^−1,^^ is then fed into the residual encoder network, together with the current image ^^^^ to produce a residual latent representation which is optionally quantised and entropy encoded and transmitted as a bit stream. On the decode side, the bitstream is received and entropy decoded back into the residual latent representation ^^ and used by the residual decoder neural network, in combination with information from the warped version of the previously decoded image ^^^^−1,^^ , to produce the reconstructed image ^^^^ . In the case of Figure 5, the information associated with the warped previously decoded image ^^^^−1,^^ is optionally first processed by module ^^ , referred to herein after as a composition adapter, for example to downsample and/or pad it before it is fed into the residual decoder together with the entropy decoded residual latent representation to produce the final reconstructed image ^^^^ . This process may then be repeated for ^^^^+1 and so on to encode, transmit, and decode a sequence of frames.
An example implementation of the flow encoder neural network 207 is shown in Figure 6. Figure 6 illustrates an example of a flow module, in this case a network 600, configured to estimate information indicative of a difference between an image ^^^^−1 and an image ^^^^ , e.g. flow information. Figure 6 is provided as an example of how such flow information may be calculated between two input or output images. The network 600 comprises a set of layers 601a, 601b respectively for an image ^^^^−1 and an image ^^^^ from respective times or positions in a sequence ^^ − 1 and ^^ of a sequence of image frames. The set of layers 601a, 601b may define one or more convolution operations and/or nonlinear activations in the layers to sequentially downsample the input images to produce a pyramid of feature maps for different levels of coarseness or spatial resolution. This may comprise performing ℎ/2 ^^/2 downsampling in a first layer, ℎ/4 ^^/4 downsampling in a second layer ℎ/8 ^^/8 downsampling in a third layer, ℎ/16 ^^/16 downsampling in a fourth layer, ℎ/32 ^^/32 downsampling in a fifth layer, ℎ/64 ^^/64 downsampling in a sixth layer, and so on. It will of course be appreciated that these downsampling operations and levels of coarseness or spatial resolution of a pyramid feature map are exemplary only and others levels are also envisaged. With the downsampling operations performed and the corresponding pyramid of feature maps generated, a first cost volume 602 is calculated at the most course level between the pixels of the first image ^^^^−1 and the corresponding pixels of in the second image ^^^^ . Cost volumes define the matchmaking cost of matching the pixels in the initial image with the pixels in the later image. That is, the closeness of each pixel, or a subset of all pixels, in the initial image
to one or more pixels in the later image is determined with a measure of closeness such as a vector or dot product, a cosine similarity, a mean absolute difference, or some other measure of closeness. This metric may be calculated against all pixels in the later image, or only for pixels in a predetermined search radius such as a 1-10 pixel radius (preferably a 1, 2, 3, or 4 pixel radius), or some other radius as described in connection with concept 4 below, around the pixel coordinate corresponding to the pixel against which the measure of closeness is being calculated. Finally, a first flow 603 can be estimated from the first cost volume 602. This may be achieved using, for example a flow extractor network which may comprise a convolutional neural network comprising a plurality of layers trained to output a tensor defining a flow map from the input cost volumes. Other methods of calculating flow information from cost volumes will also be known to the skilled person. The same process is then repeated for the other levels of coarseness to calculate a second cost volume 604 and second flow 605, and so on for the cost volumes and flows associated with each of the levels of coarseness have been calculated, up to the final cost volume 606 and flow 607. The weights and/or biases of any activation layers in network 600 (e.g. optionally in the downsampling convolution layers and/or in a flow extractor network that produces flow maps from the cost volumes) are trainable parameters and can accordingly be updated during training either alone, or in an end to end manner with the rest of the compression pipeline. The trainable nature of these parameters provides the network 600 with flexibility to produce feature maps at
each level of spatial resolution (i.e. pyramid feature maps) and/or at the flow outputs that are forced into a distribution that best allows the network to meet its training objective (e.g. better compression, better reconstruction accuracy, more accurate reconstruction of flow, and so on). For example, it allows the network 600 to produce feature maps that, when cost volumes and/or flow are calculated therefrom, produce cost volumes or flows that are distributed roughly matching the latent representation distribution that would previously have been expected to be output by a dedicated flow encoder module. This effectively allows a dedicated flow encoder to be omitted entirely from the flow compression part of the pipeline, as is shown in the illustrative example of Figure 7a, descibed later. Optionally, for each level of coarseness or resolution, the flow of the previous level or levels of coarseness or resolution may be used to warp 608, 609, the feature maps before the cost volume is calculated. This has the effect of artificially reducing the amount of relative movement between the pixels of the t and t - 1 images or feature maps when calculating the cost volumes, reducing flow errors for high movement details. The inventors have realized that removing warping entirely or in some levels of coarseness or resolution can substantially reduce runtime of flow calculation while maintaining good levels of flow accuracy. As the warping process uses inputs from different levels of coarseness or spatial resolution, the flow estimation output may be upsampled 610, 611 first to match the coarseness resolution of the feature map to which the flow is being applied in the warping process. The outputs of the flow module may accordingly be one or more cost volumes or some representation thereof, and/or one or more flows or some representation thereof).
Note that whilst the term cost volume has been used above, the information indicative of differences between the respective inputs need not be a strict cost volume in the mathematical sense, but may be any representation of this information. For example a compressively calculated cost volume, applying for example the principles of compressive sensing to estimate an approximate cost volume by sampling only a small number of pixel differences compared to performing a complete pixel-wise cost volume calculation. This compressively calculated cost volume approach may be applied to all embodiments described herein. As described above, it will be appreciated that running a flow-based compression pipeline once training has been completed relies on estimation of flow, whether by handcrafted algorithm or through some trained network. The output estimated flow itself may be compressed and transmitted in a bitstream, which is typically done by a dedicated flow encoder network that encodes the flow information into a latent representation distributed according to a distribution that can be efficiently entropy encoded. Irrespective of the flow estimation approach taken, dedicated flow encoders increase run time and partly contribute to preventing learned compression codecs from running in real time or near real time. Consider for example the pipeline of Figure 3. At the encode side, a previously reconstructed image and a new, to-be-encoded image may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames. In the example of Figure 3, both the current image being compressed ^^^^ and the previously reconstructed image from an earlier frame ^^^^−1 are passed into a flow module 207 that typically has two parts: a flow estimation part, and a dedicated encoder part that encodes the estimated flows into a
latent representation that can be efficiently entropy encoded. The dedicated flow encoder is effectively a specialized network dedicated to producing latent representations of flow that are distributed as close to an optimally entropy encodable distribution as possible. In Figure 3, these two parts are shown as a single component. The flow estimation part may comprise, for example, the flow module 600 of Figure 6. That is, the flow module 600 produces its cost volumes and flows, then passes one or more of these into the second part: the dedicated flow encoder network that is trained to produce the latent representation of flow that can be efficiently entropy encoded before being sent in a bitstream to the decoder. This approach is slow because we first have to calculate the cost volumes and flows before we can encode them into a latent representation, which itself is a slow process. We then do the entropy encoding of the latent representation of the cost volume(s) and/or flow(s) into a bitstream which is finally transmitted. This multi-step approach increases run time. Instead, in order reduce compute and runtime overhead, the present disclosure also envisages omitting the dedicated flow encoder. For example, the outputs of the flow module 600 output(s) are directly entropy encoded and fed into the bitstream without first encoding them into a latent representation. Given that the process of encoding flow module output(s) into a latent representation with a flow compression encoder is computationally expensive, removing this component entirely from the flow compression module 207 results in a significant decrease in runtime, thereby contributing to the goal of being able to run the pipeline in inference in real time or near real time.
This is, in part, made possible by virtue of the trainable nature of the flow module 600, whereby the weights and biases of one or more of the convolution and/or activation layers of the flow module 600 result in cost volumes or flows that are already distributed according to a distribution that corresponds roughly to that which a dedicated flow encoder may produce. This is illustrated in more detail in Figure 7a. Figure 7a illustratively shows two flow compression modules 700a and 700b which may be used as flow module 207 in Figure 3. The same reference numerals are used for like-components. In the first module 600a, two input images ^^^^ and ^^^^−1 are passed 701a, 701b are input into a flow module such as flow module 600 of Figure 6, which produces pyramid feature maps 702 of different levels of coarseness, and corresponding cost volumes and/or flows 703. The final flow estimation is then passed to a dedicated flow compression encoder 704 which encodes it to produce a latent representation of the final flow estimation. The output may thus be a latent representation of one or more of a H W, H/2 W/2, H/4 W/4, H/8 W/8, H/16 W16. H/32 W/32, H/64 W/64, or some other resolution, cost volumes and/or flows. This is finally entropy encoded into a bitstream 705, and transmitted. The bitstream 705 is entropy decoded and the decoded bitstream is passed to a decoder 606 which reconstructs the cost volumes and/or flows which may be used in a flow-based approach as described above in the general concepts section. However, as described above, the approach of the flow compression module 700a with a dedicated flow compression encoder 704 is slow and computationally expensive.
In contrast, the present inventors have realized that omitting the dedicated flow compression encoder 704 entirely and instead directly entropy encoding one or more outputs of the flow module 600 into the bitstream 705 without first passing it through the dedicated flow encoder 704 results in a substantial speed up at run time. Counter-intuitively, this removal of the dedicated flow compression encoder 704 does not appreciably appear to effect the performance of the rest of the compression pipeline both in terms of distortion and bit rate. This modified approach is illustrated with flow compression module 700b in Figure 7a where it is apparent that the flow encoder 704 of flow compression module 700b has effectively been chopped out. Thus, instead of encoding the cost volumes and/or flows into a latent representation before the entropy encoding step, the one or more of a H W, H/2 W/2, H/4 W/4, H/8 W/8, H/16 W16. H/32 W/32, H/64 W/64, or some other resolution, cost volumes and/or flows are simply entropy encoded directly and sent out in the bitstream. This approach is based on the insight that flow modules such as flow module 600 have trainable parameters and accordingly have a great deal of flexibility in terms of the distribution of outputs they can be trained to produce. The inventors have found that when the dedicated flow compression encoder 704 is included, the flow module 600 has no need to produce outputs in a distribution that can be efficiently entropy encoded because the neural networks that make up the compression pipeline as a whole are simply able to rely on the dedicated flow encoder 704 to minimize any contribution to bitrate that the flow information has in the bitstream. However, when the dedicated flow compression encoder 704 is removed, networks of the compression pipeline are no longer able to rely on the dedicated flow encoder 704. In its place, the inventors have found that the flow module 600 learns during training to compensate by outputting cost
volumes and/or flows that are similarly distributed as those that would be output by a dedicated flow compression encoder 704. This can be understood as the flow module 600 (that is, the trainable networks within it such as the flow extractor network and/or activation layers in the convolution layers) effectively being forced to mimic the dedicated flow compression encoder 704 during training when the loss is being minimized because the system can no longer rely on the (now removed) dedicated flow compression encoder 704 to perform that task. It is envisaged that the training of the flow module 600 may either be performed in an end-to-end manner together with the rest of the compression pipeline or alternatively in a student-teacher approach where the network of the flow module 600 is the student component, and a known optical flow model and pre-trained flow compression network is the teacher component. Additionally or alternatively, the training of the flow module 600 may be performed separately using data on which the groundtruth flow is known. For example, by using 3D animation video data whose groundtruth flow is known a priori through the animation program used to generate it, or using auto flow genreation methods. This counter-intuitive removal of the dedicated flow compression encoder 704 from the flow compression module to force the other components to effectively take on the tasks previously performed by the dedicated flow compression encoder 704 contributes significantly to a speed up at run time speed on the encoding side of the compression pipeline. This approach accordingly makes a substantial stride forward towards the goal of achieving real time or near real time performance during inference.
Figure 7b illustrates the introduction of the flow compression modules 700b into a flow-residual compression pipeline, such as that of Figure 3, by illustratively showing the encoding and decoding of p-frames of an image stream, whereby a groundtruth image and its previously encoded and decoded reconstruction at t-1 are available. This corresponds generally to the flow-residual compression pipeline shown in Figure 3 and accordingly uses the same reference numbers for corresponding features. However in this case, the output of the flow module 702 is natively in the distribution of a latent representation and can be immediately transmitted to the decoder, without needing a standalone dedicated flow compression encoder. In addition, it is envisaged that the warped reconstructed, previously decoded image from t-1 may be passed directly to the residual decoder, in addition to whatever is output by the residual encoder 211. This provides the residual decoder with the additional context of the warped reconstructed, previously decoded image. Returning now to the YUV colour space, the presently described pyramidal estimation of flow in AI-based compression pipelines provides some unexpected synergies when working in YUV colour space. Traditionally, AI-based compression research is performed on RGB data in RGB space. In RGB space, as the input tensors (with R, G and B channels) propagate through the networks, the channels in the tensor are all treated equally. For example, during flow estimation, such as described above in connection with Figures 6 to 7b, the pyramid layers and associated operations are performed treatimg the R, G and B channels of the input tensors equally. This is because, in RGB space, all channels contain an approximately equal amount of movement information in a frame sequence.
The present inventors have realised that, when moving out of RGB space and into YUV space, the flow information is largely contained in the luma channel. Accordingly, flow can be estimated using the flow module operating only on the luma channel. Cutting the number of channels that get fed through the pyramid layers from three channels to single channel results in a substantial speed up with very little loss in overall performance in terms of image reconstruction accuracy (e.g. measured by distortion such as an MSE score) and compression performance (e.g. measured in in bits per pixel). Accordingly the present concept 1 is directed to replacing a multi-channel input tensor to the flow modules shown in Figures 6 to 7b with a single-channel tensor. Whilst this is envisaged to be the luma channel when working with YUV data, this concept may also be generalised to other types of data where information in one of the channels contains enough information to sufficiently estimate flow, as well as the generation of custom data types comprising a plurality of channels where one of the channels is optimised for enabling an AI-based flow module such as that shown in Figures 6 to 7b to determine flow. For completeness, it is noted that concept 1 may also be combined with the other concepts described herein. Figure 8 illustratively shows a modified flow module 800 corresponding to that of Figure 6, like numbered reference are used for like components. Additionally, a pre-processing module 801 is introduced before the flow module 800. This may form part of the flow module 800 itself or may form part of some other component of the pipeline, for example as a component of a data intake module (not shown) or other pre-processing modules.
The pre-processing module is configured to select the luma channel from the a YUV input to feed into the flow module 800, after which the flow module 800 operates as described above in relation to Figure 6. The pseudocode provided below illustratively shows an exemplary operation of the pre- processing module, configured to select a luma channel from an input in YUV format:
Algorithm 1 Select Luma Channel from YUV Tensor procedure SelectLumaChannel(^^^^^^^^^^^^^^^^^^^^^^) ^^^^^^^^ℎ^^ ← HeightOf(^^^^^^^^^^^^^^^^^^^^^^) ^^^^^^^^ℎ ←WidthOf(^^^^^^^^^^^^^^^^^^^^^^) ^^^^^^^^^^ℎ^^^^^^^^^^ ← CreateMatrix(^^^^^^^^ℎ^^, ^^^^^^^^ℎ) for ^^ ← 1 to ^^^^^^^^ℎ^^ do for ^^ ← 1 to ^^^^^^^^ℎ do ^^^^^^^^^^ℎ^^^^^^^^^^ [^^] [ ^^] ← ^^^^^^^^^^^^^^^^^^^^^^ [1] [^^] [ ^^] end for end for Return: ^^^^^^^^^^ℎ^^^^^^^^^^ end procedure It will be apprecaited that the above example is illustrative only and other methods of implementing a luma channel selection method will be known by the skilled person. Concept 2: Downsampled YUV warping One issue that arises in the implementation of concept 1 described above is the facilitating of warping operations (for example as shown in Figure 6) performed during flow estimation when working in YUV space. In traditional compression, warping is used to exploit temporal redundancy between consecutive frames. The goal of warping is to estimate and compensate for the motion of objects or regions within a video sequence, allowing for more efficient compression by reducing the amount of information that is to be encoded. A frame is typically divided into smaller blocks, such as
macroblocks or coding units, which are then processed independently. Warping is performed on these blocks to find the best matching block in a reference frame, usually a preceding or succeeding frame, and to generate a motion vector that describes the displacement between the current block and its best match. The warping process can be described mathematically using a motion model. One commonly used motion model is the affine motion model, which assumes that the motion of a block can be represented by a linear transformation. The affine motion model is defined by a 2x3 matrix: ^ ^ ^ ^ ^^11 ^^12 ^^^^ ^ ^ ^ ^ ^ ^
where ^^11, ^^12, ^^21, ^^22 represent the and shear parameters, and ^^^^ , ^^^^ represent the translation parameters. Given a pixel (^^, ^^) in the current block, its corresponding location (^^′, ^^′) in the reference frame can be computed using the affine transformation: ^ ^ ^ ′^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^ ^ ^^1 ^^ ^ ^ ^^ ^ ^^ ^ ^ ^ ^ 1 12 ^ ^ ^ ^ ^ ^^ ^ ^ ^ ^ ^
The motion estimation process involves searching for the best matching block in the reference frame that minimizes a certain distortion metric, such as the sum of absolute differences (SAD) or the sum of squared differences (SSD). By way of example, the SAD metric can be expressed as:
∑ ^^−^1 ^∑^−^1 ^^^^^^ (^^, ^^) = (^^, ^^) − ^^(^^ + ^^, ^^ + ^^) | where ^^ (^^, ^^)
^^, ^^ + ^^) represents the pixel values of the candidate block in the reference frame shifted by (^^, ^^), and ^^ is the block size. The motion estimation search can be performed using various algorithms, such as full search, three-step search, or diamond search, to find the motion vector (^^^^, ^^^^) that minimizes the distortion metric: (^^^^, ^^^^) = arg ( m ^^i ,^n ^) ^^^^^^ (^^, ^^) Once the motion vector is determined, the current block can be reconstructed by copying the pixels from the reference frame at the displaced location indicated by the motion vector. This process is known as motion compensation. However, the affine motion model has limitations in representing complex motion patterns, such as non-rigid or deformable objects. More advanced motion models, such as the projective motion model or the elastic motion model, can be employed in traditional compression as an alternative. These models introduce additional parameters to capture more complex motion patterns at the cost of increased computational complexity. The projective motion model, for example, is defined by a 3x3 homography matrix:
^ ^ ^ ^^′ ^ ^ ^ ^ ℎ11 ℎ12 ℎ13 ^ ^ ^ ^ ^^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^^ ^ ^ ^ ^ ^ ^ ^ 1^ ^
where ^^′ is a scaling factor, and the final coordinates are obtained by dividing ^^′ and ^^′ by ^^′. Elastic motion models, such as the free-form deformation (FFD) model, allow for even more flexibility in representing complex motion. FFD models define a deformation grid over the image and use spline interpolation to compute the displacement of each pixel based on the grid points. The choice of motion model depends on the characteristics of the video content and the desired trade-off between compression efficiency and computational complexity. In practice, video compression standards, such as H.264/AVC or H.265/HEVC, often use hierarchical motion estimation and compensation techniques, where the video frames are decomposed into a pyramid of resolutions, and motion estimation is performed at each level to capture both large-scale and small-scale motion. In contrast, in AI-based compression, warping has a more indirect role: facilitating the estimation of a more accurate representation of flow at different resolutions using trained neural network pyramid layers, as shown in e.g. Figure 6, which in turn can facilitate the more accurate reconstruction by a fully neural network based residual encoder/decoder module, as shown in Figure 6b.
The present inventors have realised that there are some improvements that can be made to AI-based flow estimation when working in YUV 4:2:2 and/or 4:2:0 space. For example, when the inputs into the flow encoder of Figure 6 are luma (Y) channels in full resolution, and the associated UV channels are in a non-full resolution (using for example the 4:2:0, or 4:2:2 subsampling scheme), the output flow representation that is fed into the warping step is in full-resolution. Downstream, the final output that is eventually derived from the warped frame is to be assembled from the luma and chroma information. Accordingly, the chroma information still has to be reintroduced and recombined with the luma information in some way. That is, if the chroma information is missing entirely from the outputs of the flow estimation steps then it would have to be sent separately in the bitstream so that the decode side gets the chroma information one way or another. Instead of sending chroma information separately, it is combined again with the luma information during warping. Accordingly, the luma (Y) channel information from which flow is being estimated is recombined with the chroma (UV) channel information before warping is performed on the combined YUV information. The present inventors have realised that the overall effect on image reconstruction accuracy and compression performance that the warping step provides in the architecture of Figure 6 is approximately the same irrespective of whether the warping step is performed in the full-resolution of the luma (Y) channel or in the half-resolution of the chroma (UV) channels. However performing warping in the half-resolution of the chroma (UV) channels is substantially faster computationally. Accordingly, the full-resolution luma (Y) channel information and any flow information derived therefrom can be downsampled to match the lower resolution of the (UV) channels before the warping step is performed in this
lower resolution. The resulting warped, combined YUV information is then used to estimate the flow for the next resolution layer of the feature pyramid. The same approach may be applied to each resolution layer of the feature pyramids of the architecture of Figure 6 to provide an overall speed up of the flow encoder module of 1-10 milliseconds across a wide variety of different hardware devices, thereby contributing to the goal of real time encoding and decoding speeds. The above described approach can also be generalised beyond YUV to any input data with multiple channels where one or more of the channels is in a different resolution to the rest of the channels in the input data. Figure 9 illustratively shows an example modified flow module 900 of the present disclosure demonstrating a non-limiting implementation example of concept 2. As with Figure 8, the input ^^^^−1 and ^^^^ are pre-processed to select respectively the luma and chroma channels. This may be performed with separate pre-processing modules 901a, 901b, or with a single module or with some other component of a data intake module of the pipeline, as will be appreciated by the skilled person. The luma channel is fed into the feature pyramid layers as in Figure 6, but the chroma channels are instead fed directly into the warping steps 608, 609. For the downsampling pyramid layer where the native chroma channel resolution matches the luma channel resolution, no further downsampling of the chroma channel is performed. For the downsampling pyramid layer where the native chroma channel resolution does not match the luma channel resolution, a downsampling step (not shown) may be performed on the chroma channels so that the warping is performed using matching luma and chroma channel resolutions.
The operation of the pre-processing modules 901a and 901b to select the respective luma and chroma channels may correspond to that described in relation to concept 1 above. It will further be appreciated that concepts 1 and 2 may also be combined alone or together with concept 3 described below. Concept 3: Channel dimension stuffing In traditional AI-based compression, operating in RGB space with equal resolution R, G and B channels, the implementation of the convolution operations of any of the neural networks is straightforward and requires no special consideration as the shape of the tensor is defined by three channels of equal dimensions. The same consideration applies to 4:4:4 YUV input data where each of the Y, U and V channels are in the same resolution so the tensor comprises three channels of equal dimensions. However, the situation is different when considering 4:2:2 or 4:2:0 YUV data because the chroma channels (UV) are of different dimensions to the the luma channel (Y). The shape of the tensor is accordingly more complicated and convolutions of the neural networks of the AI-based compression pipeline cannot be applied in a straightforward way to a non-uniformly shaped tensor. The present concept 3 is directed to solving this problem. One approach is to upsample the two chroma (UV) channels to match the resolution of the luma (Y) channel to produce a uniformly shaped input tensor (effectively pre-processing YUV
4:2:0 or YUV 4:2:0 into YUV 4:4:4 data) before performing any convolutions on the, now uniformly shaped, tensor. This approach is illustrated in more detail in the following pseudocode: Algorithm 2 Upsample YUV 4:2:0 Tensor to YUV 4:4:4 Tensor function upsample_yuv420_to_yuv444_tensor(yuv420_tensor) ⊲ Assuming yuv420_tensor has shape [height, width, 3] ⊲ where the last dimension represents Y, U, and V channels height, width, _ ← shape(yuv420_tensor) ⊲ Extract the Y, U, and V channels from the yuv420_tensor y_channel ← yuv420_tensor[:, :, 0] u_channel ← yuv420_tensor[::2, ::2, 1] v_channel ← yuv420_tensor[::2, ::2, 2] ⊲ Perform upsampling for U and V channels upsampled_u_channel ← upsample(u_channel, scale_factor=2) upsampled_v_channel ← upsample(v_channel, scale_factor=2) ⊲ Stack the Y channel with the upsampled U and V channels yuv444_tensor ← stack([y_channel, upsampled_u_channel, upsampled_v_channel], axis=-1) Return: yuv444_tensor end function
That is, the method may comprise receiving an input tensor in YUV 4:2:0 format, this may
be, for example, the format in the input data has been captured natively by an image capture
device, whereby the input tensor has a height dimension and width dimension representing x-y
coordinates in an image comprising a plurality of pixels, and whereby the input tensor has
three channels made up of a luma channel (Y) and two chroma channels (UV).
The shape of the tensor is extracted and upsampling is performed using a scale factor of 2 on
the U and V channels. The Y channel is left alone. The Y channel is then stacked with the now
upsampled U and V channels and the resulting tensor is returned. The returned tensor now has
a uniform shape where all the channels have the same resolution and accordingly convolutions
can be performed on the tensor in the usual way.
The upsampling may comprise one or more of: nearest neighbour interpolation, bilinear
interpolation, bicubic interpolation, a transposed convolution (deconvolution) or any other
upsampling technique known to the skilled person.
One problem with this approach however is that it can be slow. This is because (i) upsampling
two whole channels requires additional computation time and (ii) the rest of the pipeline then
operates in a higher resolution which means there are more computations overall (e.g. in
the flow estimation, the warping, the residual estimation, and so on). This first approach is
accordingly referred to as a naive approach as focuses on simplicity and does not take into
account any synergies that YUV 4:2:0 may have with an AI-based compression pipeline.
However this approach is computationally slow and substantially increases run time as the
entire compression pipeline would be running in the higher YUV 4:4:4 space.
Another naive approach is to achieve matching tensor dimensions by introducing two or more
additional convolution operations to an input layer of the pipeline that operate with one stride
for the Y channel and a different stride for the UV channels. For example, performing a
convolution on the Y channel with a stride of 4 and a convolution on the UV channels with a
stride of 2 to produce Y, U and V channels of equal dimensions that can then be summed to
produce the overall uniformly shaped YUV tensor to feed into the rest of the neural networks
in the usual way.
This approach is illustrated in the following pseudocode:
Algorithm 3 Perform Convolutions on YUV 4:2:0 Tensor function convolve_yuv420_tensor(yuv420_tensor) ⊲ Assuming yuv420_tensor has shape [height, width, 3] ⊲ where the last dimension represents Y, U, and V channels ⊲ and the U and V channels have half the resolution of the Y channel height, width, _ ← shape(yuv420_tensor) ⊲ Define the convolution kernels for Y and UV channels y_kernel← get_convolution_kernel(stride=4) uv_kernel← get_convolution_kernel(stride=2) ⊲ Extract the Y, U, and V channels from the yuv420_tensor y_channel ← yuv420_tensor[:, :, 0] u_channel ← yuv420_tensor[::2, ::2, 1] v_channel ← yuv420_tensor[::2, ::2, 2] ⊲ Perform convolution on the Y channel with stride 4 convolved_y_channel ← convolve(y_channel, y_kernel, stride=4) ⊲ Perform convolution on the U and V channels with stride 2 convolved_u_channel ← convolve(u_channel, uv_kernel, stride=2) convolved_v_channel ← convolve(v_channel, uv_kernel, stride=2) ⊲ sum the convolved Y, U, and V channels convolved_tensor ← sum([convolved_y_channel, convolved_u_channel, convolved_v_channel], axis=-1) Return: convolved_tensor end function
That is, the method may comprise receiving an input tensor in YUV 4:2:0 format, this may be, for example, the format in the input data has been captured natively by an image capture device, whereby the input tensor has a height dimension and width dimension representing x-y coordinates in an image comprising a plurality of pixels, and whereby the input tensor has three channels made up of a luma channel (Y) and two chroma channels (UV). The shape of the tensor is identified and the respective convolution kernels for the Y channel and for the UV channels are identified. The Y, U and V channels are identified, the stride 4 convolution is performed on the Y channel, and the stride 2 convolution is performed on the U and V channels. The resulting convolved channels now have matching dimensions so can be summed to produce an overall uniformly shaped tensor which can be fed into the neural networks of the AI-compression pipeline in the usual way. This approach is an improvement over the first naive approach because the resulting tensor after the convolution does not have upsampled U and V channels so the runtime of the overall pipeline is faster. However, the present inventors have realised that a third approach exists that is substantially faster than either of the above naive approaches, and synergistically also results in improved accuracy in terms of image reconstruction and performance in terms of compression rates. Thus providing a way to handle the non-uniform shape of the YUV 4:2:0 data in a way that actually improves overall performance of the compression pipeline. This third approach is to emulate a single convolution operation on the non-uniform input by performing two different space to depth pixel unshuffle operation on the non-uniform input
tensor. Specifically, a space to depth pixel unshuffle operation with a block size of 4 is applied to the luma channel, and a space to depth pixel unshuffle operation with a block size of 2 is applied to each of the chroma channels. These are then stacked to produce the final tensor comprising 24 channels. That is, the space to depth pixel unshuffle operation with a block size of 4 takes the luma channel and applies a 4x4 block to the pixels of the luma channel which are distributed in the channel dimension to produce 16 channels. Similarly, the space to depth pixel unshuffle operation with a block size of 2 takes the U channel and applies a 2x2 block to the pixels of the U channel and distributes these in the channel dimension to produce 4 channels, and the same applies to the V channel which produces 4 more channels to give an overall 24 channels. This approach results in two advantages over the above-described naive approaches. Firstly, the present inventors have realised that, on many hardware devices, it is faster to perform operations on data that has a smaller spatial resolution and larger channel dimension than it is to perform the same operations on larger spatial resolutions but smaller channel dimensions. Accordingly, even though the input and outputs of the overall pipeline are the same, converting the input YUV data into a smaller spatial resolution format with more channel dimensions results in a significant speed up of the overall pipeline of the order of milliseconds. Secondly, the space to depth pixel unshuffle operations have the synergistic effect of increasing the receptive field of any subsequent convolution operations of the neural networks as they perform their respective convolutions on the input tensor (for example, in the flow encoder/decoder and/or residual encoder/decoder). That is, in a given convolution window,
the output can only be influenced by whatever information is in the input. If a convolution window is only a single ^^^^^^ grid of pixels of a single channel, the receptive field is the single ^^^^^^ grid of pixels of that channel. If we add additional channels to that convolution window, the output can now be influenced by the additional channel information. If we distribute pixels that originated from outside the ^^^^^^ grid into one or more additional/new channels that are included in the convolution window, then we are allowing the output to be influenced by these new pixels from outside the ^^^^^^ grid, thus indirectly increasing the receptive field of the convolution without needing to increase the spatial dimensions of the ^^^^^^ grid. In turn, the increased receptive field allows the neural networks of the pipeline to harness spatial correlations between pixels better to thereby more efficiently learn what information is redundant and can be compressed away, thereby achieving improved compression rate and improved image reconstruction accuracy for a given bit rate. Thus, in the presently described example, distributing the spatial dimension information into the channel dimensions using the applicable space to depth pixel unshuffle operations on the respective Y and UV channels solves the problem of applying convolutions throughout the pipeline to the non-uniformly shaped YUV 4:2:0 input tensor while, at the same time, the resulting 24-channel tensor achieves improved compression rates and image reconstruction accuracy in a way that is computationally efficient and achieves runtime speed ups of the order of milliseconds. The above approach is illustrated in more detail in the pseudocode below:
Algorithm 4 Pixel Unshuffle YUV 4:2:0 Tensor function pixel_unshuffle_yuv420_tensor(yuv420_tensor) ⊲ Assuming yuv420_tensor has shape [height, width, 3] ⊲ where the last dimension represents Y, U, and V channels ⊲ and the U and V channels have half the resolution of the Y channel height, width, _ ← shape(yuv420_tensor) ⊲ Extract the Y, U, and V channels from the yuv420_tensor y_channel ← yuv420_tensor[:, :, 0] u_channel ← yuv420_tensor[::2, ::2, 1] v_channel ← yuv420_tensor[::2, ::2, 2] ⊲ Apply pixel unshuffle to the Y channel with block size 4 unshuffled_y_tensor ← pixel_unshuffle(y_channel, block_size=4) ⊲ Apply pixel unshuffle to the U channel with block size 2 unshuffled_u_tensor ← pixel_unshuffle(u_channel, block_size=2) ⊲ Apply pixel unshuffle to the V channel with block size 2 unshuffled_v_tensor ← pixel_unshuffle(v_channel, block_size=2) ⊲ Stack the unshuffled Y, U, and V tensors along the channel dimension output_tensor ← stack([unshuffled_y_tensor, unshuffled_u_tensor, unshuffled_v_tensor], axis=-1) Return: output_tensor end function That is, the method may comprise receiving an input tensor in YUV 4:2:0 format, this may be, for example, the format in the input data has been captured natively by an image capture device, whereby the input tensor has a height dimension and width dimension representing x-y coordinates in an image comprising a plurality of pixels, and whereby the input tensor has three channels made up of a luma channel (Y) and two chroma channels (UV). The shape of the tensor is identified and the Y, U and V channels extracted. A pixel unshuffle operation with block size 4 is applied to the Y channel, a pixel unshuffle operation with block size 2 is
applies to the U channel, and a pixel unshuffle operation with block size 2 is applied to the V channel. The resulting 16 channel Y tensor, 4 channel U tensor and 4 channel V tensor are stacked to produce a 24 channel tensor of uniform shape which can be fed into the flow and/or residual modules of the AI-based compression pipeline as before. It will be appreciated that the above described block sizes and dimensions for the reshaping operations (e.g. a 24 channel tensor of uniform shape and so on) are illustrative only, and other dimensions and uniform tensor shapes are also envisaged. Further, pixel unshuffle is only one reshaping operation that may be used. The above approach may be generalised to any tensor reshaping operations that achieve the same effect. Thus the above algorithm may be written in general pseudocode form as:
Algorithm 5 Generalized Reshape and Permute function reshape_and_permute_tensor(input_tensor, spatial_dims, block_sizes) input_shape ← shape(input_tensor) num_spatial_dims ← length(spatial_dims) output_shape ← [] for ^^ ← 1 to num_spatial_dims do dim_index← spatial_dims[i] block_size ← block_sizes[i] output_shape.append(input_shape[dim_index] // block_size) output_shape.append(block_size) end for for ^^ ← 1 to length(input_shape) do if ^^ not in spatial_dims then output_shape.append(input_shape[i]) end if end for reshaped_tensor ← reshape(input_tensor, output_shape) permute_dims ← [i for i in range(1, num_spatial_dims+1)] + [i+num_spatial_dims for i in range(1, num_spatial_dims+1)] + [i for i in range(1, length(output_shape)+1) if i not in permute_dims] permuted_tensor ← permute(reshaped_tensor, permute_dims) final_shape ← [output_shape[i] for i in range(num_spatial_dims)] + [prod(block_sizes) * input_shape[-1]] output_tensor ← reshape(permuted_tensor, final_shape) Return: output_tensor end function
That is, in general form, the following input information is provided: an input tensor of any dimensions, with shape ^^1, ^^2, ..., ^^^^, channels, where ^^1, ^^2, ..., ^^^^ are the spatial dimensions and channels is the channel dimension, a variable ^^^^^^^^^^^^^^^^^^^^^^ which is a a list specifying the indices of the spatial dimensions to be reshaped, and the variable ^^^^^^^^^^^^^^^^^^^^ which is a list specifying the block size for each spatial dimension. The shape of the input tensor is determined and the number of spatial dimensions to be reshaped is calculated. An empty list, ^^^^^^^^^^^^^^ℎ^^^^^^, is intialised to store the shape of the intermediate reshaped tensor. We iterate over the spatial dimensions specified in ^^^^^^^^^^^^^^^^^^^^^^ and update the ^^^^^^^^^^^^^^ℎ^^^^^^ list by dividing each spatial dimension by its corresponding block size and appending the block size as a new dimension. The remaining dimensions (non-spatial dimensions) from the input tensor are appended to the ^^^^^^^^^^^^^^ℎ^^^^^^ list and we reshape the input tensor according to the ^^^^^^^^^^^^^^ℎ^^^^^^ to obtain the ^^^^^^ℎ^^^^^^^^^^^^^^^^^^^^. We create a list ^^^^^^^^^^^^^^^^^^^^^^ to specify the permutation order for the dimensions of the ^^^^^^ℎ^^^^^^^^^^^^^^^^^^^^ and iterate over the spatial dimensions and append the corresponding dimension indices to ^^^^^^^^^^^^^^^^^^^^^^, followed by the indices of the newly added block size dimensions. The remaining dimension indices (non-spatial dimensions) are appended to ^^^^^^^^^^^^^^^^^^^^^^ and we permute the dimensions of the ^^^^^^ℎ^^^^^^^^^^^^^^^^^^^^ according to ^^^^^^^^^^^^^^^^^^^^^^ to obtain the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^. The ^^ ^^^^^^^^^^ℎ^^^^^^ is calculated by appending the spatial dimensions from ^^^^^^^^^^^^^^ℎ^^^^^^ and the product of the block sizes multiplied by the number of channels and the ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ is reshaped according to the ^^ ^^^^^^^^^^ℎ^^^^^^ to obtain the ^^^^^^^^^^^^^^^^^^^^^^^^ . The ^^^^^^^^^^^^^^^^^^^^^^^^ is then returned.
The above generalised approach accordingly provides a framework for distributing pixels from the spatial dimension into channel dimensions to increase the receptive window of the subsequent convolution operations of the AI-based compression pipeline in a way that is compute efficient. The subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal. The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose
logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be
performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable devices, that includes one or more processors and computer readable media, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and
DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. The subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet. The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in multiple examples separately or in any suitable subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the examples described above should not be understood as requiring such separation in all examples, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Claims
4. The method of claim 3, wherein the subset comprises the luma channel.
5. The method of claim 4, wherein the luma channel and the at least one chroma channel are
defined in a YUV colour space.
6. The method of claim 5, wherein the at least one chroma channel has a different resolution to
the luma channel.
7. The method of claim 6, wherein the YUV colour space comprises a YUV 4:2:0 or YUV
4:2:2 colour space.
8. The method of any of claims 1 to 7, wherein using the first image and the second image to
produce a latent representation of optical flow information comprises:
producing a representation of optical flow information at a plurality of resolutions and
using the representation of optical flow information at the plurality of resolutions to produce
said latent representation of optical flow information.
9. The method of claim 8, wherein a representation of optical flow information at a first
resolution of said plurality of resolutions is based on a representation of optical flow information
at a second resolution of said plurality of resolutions.
10. The method of any of claims 8 to 9, wherein using the first image and the second image to
produce a latent representation of optical flow information comprises:
using the representation of optical flow information at one of said plurality of resolutions
to warp a representation of the first image at a different resolution of said plurality of resolutions.
11. The method of claim 10, wherein a representation of optical flow information at a different one of said plurality of resolutions is based on the warped first image and the second image at one of said plurality of resolutions. 12. The method of any of claims 10 to 11, wherein the representation of optical flow information is estimated using the subset of the plurality of channels, and wherein the warping is performed using the plurality of channels. 13. A method for lossy image or video encoding and transmission, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; wherein the first image and the second image comprise data arranged in a plurality of image channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises using a subset of the plurality of channels. 14. A method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation of optical flow information at a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in a plurality of image channels, the latent representation
of optical flow information being produced with first neural network using a subset of the plurality of channels; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image. 15. A data processing apparatus configured to perform the method of any of claims 1 to 14. 16. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 1 to 14. 17. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claims 1 to 14. 18. A method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; with a second neural network, decoding the latent representation of optical flow
information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image; wherein the first image and the second image comprise data arranged in respective luma channels and chroma channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of the optical flow information based on the respective luma channels of the first image and the second image; and downsampling the representation of the optical flow information. 19. The method of claim 18, wherein said downsampling comprises downsamping the representation of the optical flow information to a resolution of the respective chroma channels of the first image and the second image. 20. The method of claim 18 or 19, comprising warping the data in the respective chroma channels using the downsampled representation of the optical flow information. 21. The method of any of claims 18 to 20, comprising warping the data in the respective luma channels using the downsampled representation of the optical flow information. 22. The method of any of claims 18 to 21, wherein using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of optical flow information from the respective luma channels
at a plurality of resolutions and using the representation of optical flow information at the plurality of resolutions to produce said latent representation of optical flow information. 23. The method of claim 22, wherein a representation of optical flow information at a first resolution of said plurality of resolutions is based on a representation of optical flow information at a second resolution of said plurality of resolutions. 24. The method of claim 23, when dependent on claim 20 or 21 wherein the representation of optical flow information at one or more of said plurality of resolutions is based on said warped data. 25. The method of any of claims 18 to 24, wherein the respective luma channels and chroma channels are defined in a YUV colour space. 26. The method of any of claim 25, wherein the respective chroma channels have a different resolution to the luma channel. 27. The method of claim 26, wherein the YUV colour space comprises a YUV 4:2:0 or YUV 4:2:2 colour space. 28. A method for lossy image or video encoding and transmission, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a
difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; wherein the first image and the second image comprise data arranged in respective luma channels and chroma channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises: producing a representation of the optical flow information based on the respective luma channels of the first image and the second image; and downsampling the representation of the optical flow information. 29. A method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation of optical flow information to a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in respective luma channels and chroma channels, the latent representation of optical flow information being produced with a first neural network using a downsampled representation of the optical flow information based on the respective luma channels of the first image and the second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image.
30. A data processing apparatus configured to perform the method of any of claims 18 to 29. 31. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 18 to 29. 32. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claims 18 to 29. 33. A method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system, the first image and the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image.
34. The method of claim 33, wherein the information comprises pixel values of said plurality of image channels. 35. The method of claim 33 or 34, wherein the plurality of image channels comprises a subset of channels in a first spatial resolution different to a spatial resolution of the other channels in the plurality of image channels. 36. The method of any of claims 33 to 35, wherein the transformed data comprises a plurality of image channels each having a same spatial resolution. 37. The method of claim 36, wherein said same spatial resolution is lower than the spatial resolution of said subset of channels. 38. The method of claim 36 or 37, wherein said same spatial resolution is lower than the spatial resolution of said other channels of the plurality of channels, 39. The method of any of claims 33 to 38, wherein the data comprises 3-channel YUV data and wherein said transforming comprises transforming the 3-channel YUV data into 24-channel data. 40. The method of claim 39, wherein the YUV data comprises one of 4:2:0 YUV data or 4:2:2 YUV data. 41. The method of any of claims 33 to 40, wherein said transforming comprises performing a pixel unshuffle operation on the data.
42. The method of claim 41, wherein the pixel unshuffle operation is defined by a first block size for a first image channel of the data, and defined by a second block size for a second image channel of the data . 43. The method of any of claims 33 to 40, wherein said transforming comprises performing a convolution operation on the data. 44. The method of claim 43, wherein the convolution operation is defined by a first stride for a first image channel of the data, and defined by a second stride for a second image channel of the data. 45. The method of any of claims 33 to 40, wherein said transforming comprises upsampling a subset of said plurality of image channels to produce a plurality of image channels each having a same spatial resolution. 46. The method of any of claims 33 to 45, wherein one or more of the first, second or third neural networks is defined by a convolution operation, and wherein said transforming increases a receptive field of the convolution operation. 47. A method for lossy image or video encoding and transmission, the method comprising the steps of: receiving a first image and a second image at a first computer system, the first image and the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system. 48. A method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system, the first image and the second image comprising data arranged in a respective plurality of image channels; transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a first neural network, producing a latent representation of optical flow information using the transformed data, the optical flow information being indicative of a difference between the first image and the second image; receiving a latent representation of optical flow information at a second computer system, the ; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image.
49. A method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation of optical flow information at a second computer system, the optical flow information being indicative of a difference between a first image and a second image each comprising data arranged in a plurality of image channels, the latent representation of optical flow information being produced with a first neural network and by transforming the data by distributing information of the image channels from a spatial dimension into a channel dimension; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; and with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image. 50. A data processing apparatus configured to perform the method of any of claims 33 to 49. 51. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 33 to 49. 52. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claims 33 to 49.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GBGB2403951.3A GB202403951D0 (en) | 2024-03-20 | 2024-03-20 | Method and data processing system for lossy image or video encoding, transmission and decoding |
| GB2403951.3 | 2024-03-20 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025196024A1 true WO2025196024A1 (en) | 2025-09-25 |
Family
ID=90825979
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2025/057322 Pending WO2025196024A1 (en) | 2024-03-20 | 2025-03-18 | Method and data processing system for lossy image or video encoding, transmission and decoding |
Country Status (2)
| Country | Link |
|---|---|
| GB (1) | GB202403951D0 (en) |
| WO (1) | WO2025196024A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021220008A1 (en) | 2020-04-29 | 2021-11-04 | Deep Render Ltd | Image compression and decoding, video compression and decoding: methods and systems |
| US20220385907A1 (en) * | 2021-05-21 | 2022-12-01 | Qualcomm Incorporated | Implicit image and video compression using machine learning systems |
| US11936866B2 (en) * | 2022-04-25 | 2024-03-19 | Deep Render Ltd. | Method and data processing system for lossy image or video encoding, transmission and decoding |
-
2024
- 2024-03-20 GB GBGB2403951.3A patent/GB202403951D0/en not_active Ceased
-
2025
- 2025-03-18 WO PCT/EP2025/057322 patent/WO2025196024A1/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021220008A1 (en) | 2020-04-29 | 2021-11-04 | Deep Render Ltd | Image compression and decoding, video compression and decoding: methods and systems |
| US20220279183A1 (en) * | 2020-04-29 | 2022-09-01 | Deep Render Ltd | Image compression and decoding, video compression and decoding: methods and systems |
| US20220385907A1 (en) * | 2021-05-21 | 2022-12-01 | Qualcomm Incorporated | Implicit image and video compression using machine learning systems |
| US11936866B2 (en) * | 2022-04-25 | 2024-03-19 | Deep Render Ltd. | Method and data processing system for lossy image or video encoding, transmission and decoding |
Non-Patent Citations (4)
| Title |
|---|
| AGUSTSSON, E.MINNEN, D.JOHNSTON, N.BALLE, J.HWANG, S. J.TODERICI, G.: "Scale-space flow for end-to-end optimized video compression", IN PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2020, pages 8503 - 8512 |
| BALLÉ, JOHANNES ET AL.: "Variational image compression with a scale hyperprior", ARXIV PREPRINT ARXIV: 1802.01436, 2018 |
| MENTZER, F.AGUSTSSON, E.BALLE, J.MINNEN, D.JOHNSTON, N.TODERICI, G.: "Neural video compression using gans for detail synthesis and propagation", IN COMPUTER VISION-ECCV 2022: 17TH EUROPEAN CONFERENCE, 23 October 2022 (2022-10-23), pages 562 - 578 |
| POURREZA, R.COHEN, T.: "Extending neural p-frame codecs for b-frame coding", IN PROCEEDINGS OF THE IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2021, pages 6680 - 6689 |
Also Published As
| Publication number | Publication date |
|---|---|
| GB202403951D0 (en) | 2024-05-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TWI834087B (en) | Method and apparatus for reconstruct image from bitstreams and encoding image into bitstreams, and computer program product | |
| TWI850806B (en) | Attention-based context modeling for image and video compression | |
| US12309422B2 (en) | Method for chroma subsampled formats handling in machine-learning-based picture coding | |
| EP4445609A1 (en) | Method of video coding by multi-modal processing | |
| JP7591338B2 (en) | Decoding using signaling of segmentation information | |
| US12113985B2 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| JP2024511587A (en) | Independent placement of auxiliary information in neural network-based picture processing | |
| JP2024513693A (en) | Configurable position of auxiliary information input to picture data processing neural network | |
| WO2024170794A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| CN116508320A (en) | Chroma Subsampling Format Processing Method in Image Decoding Based on Machine Learning | |
| WO2025082896A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding using image comparisons and machine learning | |
| WO2024140849A1 (en) | Method, apparatus, and medium for visual data processing | |
| WO2025103602A1 (en) | Method and apparatus for video compression using skip modes | |
| WO2025196024A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025061586A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025002424A1 (en) | Method, apparatus, and medium for visual data processing | |
| WO2024193710A1 (en) | Method, apparatus, and medium for visual data processing | |
| WO2024193708A1 (en) | Method, apparatus, and medium for visual data processing | |
| US12513297B2 (en) | Method for chroma subsampled formats handling in machine-learning-based picture coding | |
| WO2025082523A1 (en) | Method, apparatus, and medium for visual data processing | |
| WO2024193709A9 (en) | Method, apparatus, and medium for visual data processing | |
| WO2025168485A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2024246275A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025088034A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025162929A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25717569 Country of ref document: EP Kind code of ref document: A1 |