WO2025196024A1

WO2025196024A1 - Method and data processing system for lossy image or video encoding, transmission and decoding

Info

Publication number: WO2025196024A1
Application number: PCT/EP2025/057322
Authority: WO
Inventors: Christian ETMANN; Evan CATON; Aaron BERK; Christopher FINLAY; Arsalan ZAFAR
Original assignee: Deep Render Ltd
Current assignee: Deep Render Ltd
Priority date: 2024-03-20
Filing date: 2025-03-18
Publication date: 2025-09-25
Anticipated expiration: 2026-09-20
Also published as: GB202403951D0

Abstract

A method for lossy image or video encoding, transmission, and decoding, the method comprising the steps of: with a first neural network, producing a latent representation of optical flow information using the first image and the second image, the optical flow information being indicative of a difference between the first image and the second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information, wherein the output image is an approximation of the first image; wherein the first image and the second image comprise data arranged in a respective plurality of image channels; and wherein using the first image and the second image to produce a latent representation of optical flow information comprises using a subset of the plurality of channels.

Description

_{Method and data processing system for lossy image or video} _{encoding, transmission and decoding} BACKGROUND _{This invention relates to a method and system for lossy image or video encoding, transmission} _{and decoding, a method, apparatus, computer program and computer readable storage medium} _{for lossy image or video encoding and transmission, and a method, apparatus, computer} _{program and computer readable storage medium for lossy image or video receipt and decoding.} _{There is increasing demand from users of communications networks for images and video} _{content. Demand is increasing not just for the number of images viewed, and for the playing} _{time of video; demand is also increasing for higher resolution content. This places increasing} _{demand on communications networks and increases their energy use because of the larger} _{amount of data being transmitted.} _{To reduce the impact of these issues, image and video content is compressed for transmission} _{across the network. The compression of image and video content can be lossless or lossy} _{compression. In lossless compression, the image or video is compressed such that all of the} _{original information in the content can be recovered on decompression. However, when using} _{lossless compression there is a limit to the reduction in data quantity that can be achieved. In} _{lossy compression, some information is lost from the image or video during the compression} _{process. Known compression techniques attempt to minimise the apparent loss of information} _{by the removal of information that results in changes to the decompressed image or video that} _{is not particularly noticeable to the human visual system. JPEG, JPEG2000, AVC, HEVC and} _{AVI are examples of compression processes for image and/or video files.} _{In general terms, known lossy image compression techniques use the spatial correlations} _{between pixels in images to remove redundant information during compression. For example,} _{in an image of a blue sky, if a given pixel is blue, there is a high likelihood that the neighbouring} _{pixels, and their neighbouring pixels, and so on, are also blue. There is accordingly no need to} _{retain all the raw pixel data. Instead, we can retain only a subset of the pixels which take up} _{fewer bits and infer the pixel values of the other pixels using information derived from spatial} correlations. _{A similar approach is applied in known lossy video compression techniques. That is, spatial} _{correlations between pixels allow the removal of redundant information during compression.} _{However, in video compression, there is further information redundancy in the form of temporal} _{correlations. For example, in a video of an aircraft flying across a blue-sky background, most} _{of the pixels of the blue sky do not change at all between frames of the video. The most} _{of the blue sky pixel data for the frame at position t = 0 in the video is identical to that at} _{position t = 10. Storing this identical, temporally correlated, information is inefficient. Instead,} _{only the blue sky pixel data for a subset of the frames is stored and the rest are inferred from} _{information derived from temporal correlations.} _{In the realm of lossy video compression in particular, the removal of redundant temporally} _{correlated information in a video sequence is known inter-frame redundancy.} _{One technique using inter-frame redundancy that is widely used in standard video compression} _{algorithms involves the categorization of video frames into three types: I-frames, P-frames, and} _{B-frames. Each frame type carries distinct properties concerning their encoding and decoding} _{process, playing different roles in achieving high compression ratios while maintaining} _{acceptable visual quality.} _{I-frames, or intra-coded frames, serve as the foundation of the video sequence. These frames} _{are self-contained, each one encoding a complete image without reference to any other frame.} _{In terms of compression, I-frames are least compressed among all frame types, thus carrying} _{the most data. However, their independence provides several benefits, including being the} _{starting point for decompression and enabling random access, crucial for functionalities like} _{fast-forwarding or rewinding the video.} _{P-frames, or predictive frames, utilize temporal redundancy in video sequences to achieve} _{greater compression. Instead of encoding an entire image like an I-frame, a P-frame represents} _{the difference between itself and the closest preceding I- or P-frame. The process, known as} _{motion compensation, identifies and encodes only the changes that have occurred, thereby} _{significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on} _{previous frames for decoding. Consequently, any error during the encoding or transmission} _{process may propagate to subsequent frames, impacting the overall video quality.} _{B-frames, or bidirectionally predictive frames, represent the highest level of compression.} _{Unlike P-frames, B-frames use both the preceding and following frames as references in their} _{encoding process. By predicting motion both forwards and backwards in time, B-frames} _{encode only the differences that cannot be accurately anticipated from the previous and next} _{frames, leading to substantial data reduction. Although this bidirectional prediction makes} _{B-frames more complex to generate and decode, it does not propagate decoding errors since} _{they are not used as references for other frames. Artificial intelligence (AI) based compression} _{techniques achieve compression and decompression of images and videos through the use of} _{trained neural networks in the compression and decompression process. Typically, during} _{training of the neutral networks, the difference between the original image and video and the} _{compressed and decompressed image and video is analyzed and the parameters of the neural} _{networks are modified to reduce this difference while minimizing the data required to transmit} _{the content. However, AI based compression methods may achieve poor compression results} _{in terms of the appearance of the compressed image or video or the amount of information} _{required to be transmitted.} _{An example of an AI based image compression process comprising a hyper-network is described} _{in Ballé, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv} _{preprint arXiv:1802.01436 (2018), which is hereby incorporated by reference.} _{An example of an AI based video compression approach is shown in Agustsson, E., Minnen, D.,} _{Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end} _{optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer} _{Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference.} _{A further example of an AI based video compression approach is shown in Mentzer, F.,} _{Agustsson, E., Ballé, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural} _{video compression using gans for detail synthesis and propagation. In Computer Vision–ECCV} _{2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part} _{XXVI (pp. 562-578), which is hereby incorporated by reference.} SUMMARY _{According to an aspect of the present disclosure, there is provided a method for lossy image or} _{video encoding and transmission, and decoding, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information} _{using the first image and the second image, the optical flow information being indicative of a} _{difference between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image;} _{wherein the first image and the second image comprise data arranged in a respective} _{plurality of image channels; and} _{wherein using the first image and the second image to produce a latent representation of} _{optical flow information comprises using a subset of the plurality of channels.} _{Optionally, the subset comprises a single channel of the plurality of channels of each of the} _{first image and the second image.} _{Optionally, the plurality of channels of each of the first image and the second image comprise} _{a luma channel and at least one chroma channel.} _{Optionally, the subset comprises the luma channel.} _{Optionally, the luma channel and the at least one chroma channel are defined in a YUV colour} space. _{Optionally, the at least one chroma channel has a different resolution to the luma channel.} _{Optionally, the YUV colour space comprises a YUV 4:2:0 or YUV 4:2:2 colour space.} _{Optionally, using the first image and the second image to produce a latent representation of} _{optical flow information comprises:} _{producing a representation of optical flow information at a plurality of resolutions and} _{using the representation of optical flow information at the plurality of resolutions to produce} _{said latent representation of optical flow information.} _{Optionally, a representation of optical flow information at a first resolution of said plurality of} _{resolutions is based on a representation of optical flow information at a second resolution of} _{said plurality of resolutions.} _{Optionally, using the first image and the second image to produce a latent representation of} _{optical flow information comprises:} _{using the representation of optical flow information at one of said plurality of resolutions} _{to warp a representation of the first image at a different resolution of said plurality of resolutions.} _{Optionally, wherein a representation of optical flow information at a different one of said} _{plurality of resolutions is based on the warped first image and the second image at one of said} _{plurality of resolutions.} _{Optionally, the representation of optical flow information is estimated using the subset of the} _{plurality of channels, and wherein the warping is performed using the plurality of channels.} _{According to an aspect, there is provided a method for lossy image or video encoding and} _{transmission, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information} _{using the first image and the second image, the optical flow information being indicative of a} _{difference between the first image and the second image;} _{wherein the first image and the second image comprise data arranged in a plurality of} _{image channels; and} _{wherein using the first image and the second image to produce a latent representation of} _{optical flow information comprises using a subset of the plurality of channels.} _{According to an aspect, there is provided a method for lossy image or video receipt and} _{decoding, the method comprising the steps of:} _{receiving a latent representation of optical flow information at a second computer system,} _{the optical flow information being indicative of a difference between a first image and a second} _{image each comprising data arranged in a plurality of image channels, the latent representation} _{of optical flow information being produced with first neural network using a subset of the} _{plurality of channels;} _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.} _{According to an aspect, there is provided data processing apparatus configured to perform any} _{of the above methods.} _{According to an aspect, there is provided a computer program comprising instructions which,} _{when the program is executed by a computer, cause the computer to carry out any of the above} methods. _{According to an aspect, there is provided a computer-readable storage medium comprising} _{instructions which, when executed by a computer, cause the computer to carry out any of the} _{above methods.} _{According to an aspect of the present disclosure, there is provided a method for lossy image or} _{video encoding and transmission, and decoding, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information} _{using the first image and the second image, the optical flow information being indicative of a} _{difference between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image;} _{wherein the first image and the second image comprise data arranged in respective luma} _{channels and chroma channels; and} _{wherein using the first image and the second image to produce a latent representation of} _{optical flow information comprises:} _{producing a representation of the optical flow information based on the respective} _{luma channels of the first image and the second image; and} _{downsampling the representation of the optical flow information.} _{Optionally, said downsampling comprises downsamping the representation of the optical flow} _{information to a resolution of the respective chroma channels of the first image and the second} image. _{Optionally, the method comprises warping the data in the respective chroma channels using} _{the downsampled representation of the optical flow information.} _{Optionally, the method comprisies warping the data in the respective luma channels using the} _{downsampled representation of the optical flow information.} _{Optionally, using the first image and the second image to produce a latent representation of} _{optical flow information comprises:} _{producing a representation of optical flow information from the respective luma channels} _{at a plurality of resolutions and using the representation of optical flow information at the} _{plurality of resolutions to produce said latent representation of optical flow information.} _{Optionally, a representation of optical flow information at a first resolution of said plurality of} _{resolutions is based on a representation of optical flow information at a second resolution of} _{said plurality of resolutions.} _{Optionally, the representation of optical flow information at one or more of said plurality of} _{resolutions is based on said warped data.} _{Optionally, the respective luma channels and chroma channels are defined in a YUV colour} space. _{Optionally, the respective chroma channels have a different resolution to the luma channel.} _{Optionally, the YUV colour space comprises a YUV 4:2:0 or YUV 4:2:2 colour space.} _{According to an aspect, there is provided a method for lossy image or video encoding and} _{transmission, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information} _{using the first image and the second image, the optical flow information being indicative of a} _{difference between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{herein the first image and the second image comprise data arranged in respective luma} _{channels and chroma channels; and} _{wherein using the first image and the second image to produce a latent representation of} _{optical flow information comprises:} _{producing a representation of the optical flow information based on the respective} _{luma channels of the first image and the second image; and} _{downsampling the representation of the optical flow information.} _{According to an aspect, there is provided a method for lossy image or video receipt and} _{decoding, the method comprising the steps of:} _{receiving a latent representation of optical flow information to a second computer system,} _{the optical flow information being indicative of a difference between a first image and a second} _{image each comprising data arranged in respective luma channels and chroma channels, the} _{latent representation of optical flow information being produced with a first neural network} _{using a downsampled representation of the optical flow information based on the respective} _{luma channels of the first image and the second image;} _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.} _{According to an aspect, there is provided data processing apparatus configured to perform any} _{of the above methods.} _{According to an aspect, there is provided a computer program comprising instructions which,} _{when the program is executed by a computer, cause the computer to carry out any of the above} methods. _{According to an aspect, there is provided a computer-readable storage medium comprising} _{instructions which, when executed by a computer, cause the computer to carry out any of the} _{above methods.} _{According to an aspect, there is provided a method for lossy image or video encoding and} _{transmission, and decoding, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system, the first image and} _{the second image comprising data arranged in a respective plurality of image channels;} _{transforming the data by distributing information of the image channels from a spatial} _{dimension into a channel dimension;} _{with a first neural network, producing a latent representation of optical flow information} _{using the transformed data, the optical flow information being indicative of a difference} _{between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.} _{Optionally, the information comprises pixel values on said plurality of image channels.} _{Optionally, the plurality of image channels comprises a subset of channels in a first spatial} _{resolution different to a spatial resolution of the other channels in the plurality of image} channels. _{Optionally, the transformed data comprises a plurality of image channels each having a same} _{spatial resolution.} _{Optionally, said same spatial resolution is lower than the spatial resolution of said subset of} channels. _{Optionally, said same spatial resolution is lower than the spatial resolution of said other} _{channels of the plurality of channels,} _{Optionally, the data comprises 3-channel YUV data and wherein said transforming comprises} _{transforming the 3-channel YUV data into 24-channel data.} _{Optionally, the YUV data comprises one of 4:2:0 YUV data or 4:2:2 YUV data.} _{Optionally, the transformation comprises performing a pixel unshuffle operation on the data.} _{Optionally, the pixel unshuffle operation is defined by a first block size for a first image channel} _{of the data, and defined by a second block size for a second image channel of the data .} _{Optionally, the transformation comprises performing a convolution operation on the data.} _{Optionally, the convolution operation is defined by a first stride for a first image channel of the} _{data, and defined by a second stride for a second image channel of the data.} _{Optionally, the transformation comprises upsampling a subset of said plurality of image} _{channels to produce a plurality of image channels each having a same spatial resolution.} _{Optionally, one or more of the first, second or third neural networks is defined by a convolution} _{operation, and wherein said transforming increases a receptive field of the convolution} operation. _{According to an aspect of the present disclsoure, there is provided a method for lossy image or} _{video encoding and transmission, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system, the first image and} _{the second image comprising data arranged in a respective plurality of image channels;} _{transforming the data by distributing information of the image channels from a spatial} _{dimension into a channel dimension;} _{with a first neural network, producing a latent representation of optical flow information} _{using the transformed data, the optical flow information being indicative of a difference} _{between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system. _{According to an aspect of the present disclosure, there is provided a method for lossy image or} _{video receipt and decoding, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system, the first image and} _{the second image comprising data arranged in a respective plurality of image channels;} _{transforming the data by distributing information of the image channels from a spatial} _{dimension into a channel dimension;} _{with a first neural network, producing a latent representation of optical flow information} _{using the transformed data, the optical flow information being indicative of a difference} _{between the first image and the second image;} _{receiving a latent representation of optical flow information at a second computer system,} _{the ;} _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.} _{According to an aspect of the present disclosure, there is provided a method for lossy image or} _{video receipt and decoding, the method comprising the steps of:} _{receiving a latent representation of optical flow information at a second computer system,} _{the optical flow information being indicative of a difference between a first image and a second} _{image each comprising data arranged in a plurality of image channels, the latent representation} _{of optical flow information being produced with a first neural network and by transforming the} _{data by distributing information of the image channels from a spatial dimension into a channel} dimension; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.} _{According to an aspect, there is provided data processing apparatus configured to perform any} _{of the above methods.} _{According to an aspect, there is provided a computer program comprising instructions which,} _{when the program is executed by a computer, cause the computer to carry out any of the above} methods. _{According to an aspect, there is provided a computer-readable storage medium comprising} _{instructions which, when executed by a computer, cause the computer to carry out any of the} _{above methods.} _{BRIEF DESCRIPTION OF THE DRAWINGS} _{Aspects of the invention will now be described by way of examples, with reference to the} _{following figures in which:} _{Figure 1 illustrates an example of an image or video compression, transmission and decom-} _{pression pipeline.} _{Figure 2 illustrates a further example of an image or video compression, transmission and} _{decompression pipeline including a hyper-network.} _{Figure 3 illustrates an example of a video compression, transmission and decompression} pipeline. _{Figure 4 illustrates an example of a video compression, transmission and decompression} system. _{Figure 5 illustrates an example of a video compression, transmission and decompression} system. _{Figure 6 illustrates an example of a flow module of a video compression, transmission and} _{decompression system.} _{Figure 7a illustrates an example of a video compression, transmission and decompression} system. _{Figure 7b illustrates an example of a video compression, transmission and decompression} system. _{Figure 8 illustrates an example of a flow module of a video compression, transmission and} _{decompression system.} _{Figure 9 illustrates an example of a flow module of a video compression, transmission and} _{decompression system.} _{DETAILED DESCRIPTION OF THE DRAWINGS} _{Compression processes may be applied to any form of information to reduce the amount} _{of data, or file size, required to store that information. Image and video information is an} _{example of information that may be compressed. The file size required to store the information,} _{particularly during a compression process when referring to the compressed file, may be} _{referred to as the rate. In general, compression can be lossless or lossy. In both forms of} _{compression, the file size is reduced. However, in lossless compression, no information is lost} _{when the information is compressed and subsequently decompressed. This means that the} _{original file storing the information is fully reconstructed during the decompression process.} _{In contrast to this, in lossy compression information may be lost in the compression and} _{decompression process and the reconstructed file may differ from the original file. Image and} _{video files containing image and video data are common targets for compression.} _{In a compression process involving an image, the input image may be represented as ^^. The} _{data representing the image may be stored in a tensor of dimensions ^^ × ^^ × ^^, where ^^} _{represents the height of the image, ^^ represents the width of the image and ^^ represents the} _{number of channels of the image. Each ^^ × ^^ data point of the image represents a pixel value} _{of the image at the corresponding location. Each channel ^^ of the image represents a different} _{component of the image for each pixel which are combined when the image file is displayed by} _{a device. For example, an image file may have 3 channels with the channels representing the} _{red, green and blue component of the image respectively. In this case, the image information} _{is stored in the RGB colour space, which may also be referred to as a model or a format.} _{Other examples of colour spaces or formats include the CMKY and the YCbCr colour models.} _{However, the channels of an image file are not limited to storing colour information and other} _{information may be represented in the channels. As a video may be considered a series of} _{images in sequence, any compression process that may be applied to an image may also be} _{applied to a video. Each image making up a video may be referred to as a frame of the video.} _{The output image may differ from the input image and may be represented by ^^. The difference} _{between the input image and the output image may be referred to as distortion or a difference} _{in image quality. The distortion can be measured using any distortion function which receives} _{the input image and the output image and provides an output which represents the difference} _{between input image and the output image in a numerical way. An example of such a method} _{is using the mean square error (MSE) between the pixels of the input image and the output} _{image, but there are many other ways of measuring distortion, as will be known to the person} _{skilled in the art. The distortion function may comprise a trained neural network.} _{Typically, the rate and distortion of a lossy compression process are related. An increase in} _{the rate may result in a decrease in the distortion, and a decrease in the rate may result in an} _{increase in the distortion. Changes to the distortion may affect the rate in a corresponding} _{manner. A relation between these quantities for a given compression technique may be defined} _{by a rate-distortion equation.} _{AI based compression processes may involve the use of neural networks. A neural network is} _{an operation that can be performed on an input to produce an output. A neural network may} _{be made up of a plurality of layers. The first layer of the network receives the input. One or} _{more operations may be performed on the input by the layer to produce an output of the first} _{layer. The output of the first layer is then passed to the next layer of the network which may} _{perform one or more operations in a similar way. The output of the final layer is the output of} _{the neural network.} _{Each layer of the neural network may be divided into nodes. Each node may receive at least} _{part of the input from the previous layer and provide an output to one or more nodes in a} _{subsequent layer. Each node of a layer may perform the one or more operations of the layer on} _{at least part of the input to the layer. For example, a node may receive an input from one or} _{more nodes of the previous layer. The one or more operations may include a convolution, a} _{weight, a bias and an activation function. Convolution operations are used in convolutional} _{neural networks. When a convolution operation is present, the convolution may be performed} _{across the entire input to a layer. Alternatively, the convolution may be performed on at least} _{part of the input to the layer.} _{Each of the one or more operations is defined by one or more parameters that are associated} _{with each operation. For example, the weight operation may be defined by a weight matrix} _{defining the weight to be applied to each input from each node in the previous layer to each} _{node in the present layer. In this example, each of the values in the weight matrix is a parameter} _{of the neural network. The convolution may be defined by a convolution matrix, also known} _{as a kernel. In this example, one or more of the values in the convolution matrix may be a} _{parameter of the neural network. The activation function may also be defined by values which} _{may be parameters of the neural network. The parameters of the network may be varied during} _{training of the network.} _{Other features of the neural network may be predetermined and therefore not varied during} _{training of the network. For example, the number of layers of the network, the number of} _{nodes of the network, the one or more operations performed in each layer and the connections} _{between the layers may be predetermined and therefore fixed before the training process takes} _{place. These features that are predetermined may be referred to as the hyperparameters of the} _{network. These features are sometimes referred to as the architecture of the network.} _{To train the neural network, a training set of inputs may be used for which the expected output,} _{sometimes referred to as the ground truth, is known. The initial parameters of the neural} _{network are randomized and the first training input is provided to the network. The output of} _{the network is compared to the expected output, and based on a difference between the output} _{and the expected output the parameters of the network are varied such that the difference} _{between the output of the network and the expected output is reduced. This process is then} _{repeated for a plurality of training inputs to train the network. The difference between the} _{output of the network and the expected output may be defined by a loss function. The result of} _{the loss function may be calculated using the difference between the output of the network} _{and the expected output to determine the gradient of the loss function. Back-propagation of} _{the gradient descent of the loss function may be used to update the parameters of the neural} _{network using the gradients ^^^^/^^^^ of the loss function. A plurality of neural networks in a} _{system may be trained simultaneously through back-propagation of the gradient of the loss} _{function to each network.} _{In the context of image or video compression, this type of system, where simultaneous training} _{with back-propagation through each element or the whole network architecture may be referred} _{to as end-to-end, learned image or video compression. Unlike in traditional compression} _{algorithms that use primarily handcrafted, manually constructed steps, an end-to-end learned} _{system learns itself during training what combination of parameters best achieves the goal of} _{minimising the loss function. This approach is advantageous compared to systems that are not} _{end-to-end learned because an end-to-end system has a greater flexibility to learn weights and} _{parameters that might be counter-intuitive to someone handcrafting features.} _{It will be appreciated that the term "training" or "learning" as used herein means the process} _{of optimizing an artificial intelligence or machine learning model, based on a given set of data.} _{This involves iteratively adjusting the parameters of the model to minimize the discrepancy} _{between the model’s predictions and the actual data, represented by the above-described} _{rate-distortion loss function.} _{The training process may comprise multiple epochs. An epoch refers to one complete pass} _{of the entire training dataset through the machine learning algorithm. During an epoch, the} _{model’s parameters are updated in an effort to minimize the loss function. It is envisaged that} _{multiple epochs may be used to train a model, with the exact number depending on various} _{factors including the complexity of the model and the diversity of the training data.} _{Within each epoch, the training data may be divided into smaller subsets known as batches.} _{The size of a batch, referred to as the batch size, may influence the training process. A smaller} _{batch size can lead to more frequent updates to the model’s parameters, potentially leading to} _{faster convergence to the optimal solution, but at the cost of increased computational resources.} _{Conversely, a larger batch size involves fewer updates, which can be more computationally} _{efficient but might converge slower or even fail to converge to the optimal solution.} _{The learnable parameters are updated by a specified amount each time, determined by the} _{learning rate. The learning rate is a hyperparameter that decides how much the parameters} _{are adjusted during the training process. A smaller learning rate implies smaller steps in the} _{parameter space and a potentially more accurate solution, but it may require more epochs to} _{reach that solution. On the other hand, a larger learning rate can expedite the training process} _{but may risk overshooting the optimal solution or causing the training process to diverge.} _{The training described herein may involve use of a validation set, which is a portion of the} _{data not used in the initial training, which is used to evaluate the model’s performance and to} _{prevent overfitting. Overfitting occurs when a model learns the training data too well, to the} _{point that it fails to generalize to unseen data. Regularization techniques, such as dropout or} _{L1/L2 regularization, can also be used to mitigate overfitting.} _{It will be appreciated that training a machine learning model is an iterative process that} _{may comprise selection and tuning of various parameters and hyperparameters. As will be} _{appreciated, the specific details, such as hyper parameters and so on, of the training process} _{may vary and it is envisaged that producing a trained model in this way may achieved in a} _{number of different ways with different epochs, batch sizes, learning rates, regularisations,} _{and so on, the details of which are not essential to enabling the advantages and effects of the} _{present disclosure, except where stated otherwise. The point at which an “untrained” neural} _{network is considered be “trained” is envisaged to be case specific and depend on, for example,} _{on a number of epochs, a plateauing of any further learning, or some other metric and is not} _{considered to be essential in achieving the advantages described herein.} _{More details of an end-to-end, learned compression process will now be described. It will be} _{appreciated that in some cases, end-to-end, learned compression processes may be combined} _{with one or more components that are handcrafted or trained separately.} _{In the case of AI based image or video compression, the loss function may be defined by the} _{rate distortion equation. The rate distortion equation may be represented by ^^^^^^^^ = ^^ + ^^ ∗ ^^,} _{where ^^ is the distortion function, ^^ is a weighting factor, and ^^ is the rate loss. ^^ may be} _{referred to as a lagrange multiplier. The langrange multiplier provides as weight for a particular} _{term of the loss function in relation to each other term and can be used to control which terms} _{of the loss function are favoured when training the network.} _{In the case of AI based image or video compression, a training set of input images may} _{be used. An example training set of input images is the KODAK image set (for example} _{at www.cs.albany.edu/ xypan/research/snr/Kodak.html). An example training set of input} _{images is the IMAX image set. An example training set of input images is the Imagenet} _{dataset (for example at www.image-net.org/download). An example training set of input} _{images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at} http://challenge.compression.cc/tasks/). _{An example of an AI based compression, transmission and decompression process 100 is} _{shown in Figure 1. As a first step in the AI based compression process, an input image 5 is} _{provided. The input image 5 is provided to a trained neural network 110 characterized by a} _{function ^^^^ acting as an encoder. The encoder neural network 110 produces an output based} _{on the input image. This output is referred to as a latent representation of the input image 5. In} _{a second step, the latent representation is quantised in a quantisation process 140 characterised} _{by the operation ^^, resulting in a quantized latent. The quantisation process transforms the} _{continuous latent representation into a discrete quantized latent. An example of a quantization} _{process is a rounding function.} _{In a third step, the quantized latent is entropy encoded in an entropy encoding process 150 to} _{produce a bitstream 130. The entropy encoding process may be for example, range or arithmetic} _{encoding. In a fourth step, the bitstream 130 may be transmitted across a communication} network. _{In a fifth step, the bitstream is entropy decoded in an entropy decoding process 160. The} _{quantized latent is provided to another trained neural network 120 characterized by a function} _{^^^^ acting as a decoder, which decodes the quantized latent. The trained neural network 120} _{produces an output based on the quantized latent. The output may be the output image of the} _{AI based compression process 100. The encoder-decoder system may be referred to as an} autoencoder. _{Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly} _{compress given input data up to close to the fundamental entropy limit of that data, as determined} _{by the total entropy of the distribution of that data. Accordingly, one way in which end-to-end,} _{learned compression can minimise the rate loss term of the rate-distortion loss function and} _{thereby increase compression effectiveness is to learn autoencoder parameter values that} _{produce low entropy latent representation distributions. Producing latent representations} _{distributed with as low an entropy as possible allows entropy encoding to compress the latent} _{distributions as close to or to the fundamental entropy limit for that distribution. The lower} _{the entropy of the distribution, the more entropy encoding can losslessly compress it and the} _{lower the amount of data in the corresponding bitstream. In some cases where the latent} _{representation is distributed according to a gaussian or Laplacian distribution, this learning} _{may comprise learning optimal location and scale parameters of the gaussian or Laplacian} _{distributions, in other cases, it allows the learning of more flexible latent representation} _{distributions which can further help to achieve the minimising of the rate-distortion loss} _{function in ways that are not intuitive or possible to do with handcrafted features. Examples of} _{these and other advantages are described in WO2021/220008A1, which is incorporated in its} _{entirety by reference.} _{Something which is closely linked to the entropy encoding of the latent distribution and which} _{accordingly also has an effect on the effectiveness of compression of end-to-end learned} _{approaches is the quantisation step. During inference, a rounding function may be used to} _{quantise a latent representation distribution into bins of given sizes, a rounding function is} _{not differentiable everywhere. Rather, a rounding function is effectively one or more step} _{functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary} _{between steps). Back propagating a gradient of a loss function through a rounding function} _{is challenging. Instead, during training, quantisation by rounding function is replaced by} _{one or more other approaches. For example, the functions of a noise quantisation model are} _{differentiable everywhere and accordingly do allow backpropagation of the gradient of the} _{loss function through the quantisation parts of the end-to-end, learned system. Alternatively, a} _{straight-through estimator (STE) quantisation model or one other quantisation models may be} _{used. It is also envisaged that different quantisation models may be used for during evaluation} _{of different term of the loss function. For example, noise quantisation may used to evaluate the} _{rate or entropy loss term of the rate-distortion loss function while STE quantisation may be} _{used to evaluate the distortion term.} _{In a similar manner to how learning parameters top produce certain distributions of the latent} _{representation facilitates achieving better rate loss term minimisation, end-to-end learning of} _{the quantisation process achieves a similar effect. That is, learnable quantisation parameters} _{provide the architecture with a further degree of freedom to achieve the goal of minimising the} _{loss function. For example, parameters corresponding to quantisation bin sizes may be learned} _{which is likely to result in an improved rate-distortion loss outcome compared to approaches} _{using hand-crafted quantisation bin sizes.} _{Further, as the rate-distortion loss function constantly has to balance a rate loss term against a} _{distortion loss term, it has been found that the more degrees of freedom the system has during} _{training, the better the architecture is at achieving optimal rate and distortion trade off.} _{The system described above may be distributed across multiple locations and/or devices. For} _{example, the encoder 110 may be located on a device such as a laptop computer, desktop} _{computer, smart phone or server. The decoder 120 may be located on a separate device which} _{may be referred to as a recipient device. The system used to encode, transmit and decode the} _{input image 5 to obtain the output image 6 may be referred to as a compression pipeline.} _{The AI based compression process may further comprise a hyper-network 105 for the} _{transmission of meta-information that improves the compression process. The hyper-network} _{105 comprises a trained neural network 115 acting as a hyper-encoder ^^ ℎ} ^^ _{and a trained neural} _{network 125 acting as a hyper-decoder ^^ℎ} ^^_{. An example of such a system is shown in Figure 2.} _{Components of the system not further discussed may be assumed to be the same as discussed} _{above. The neural network 115 acting as a hyper-decoder receives the latent that is the output of} _{the encoder 110. The hyper-encoder 115 produces an output based on the latent representation} _{that may be referred to as a hyper-latent representation. The hyper-latent is then quantized} _{in a quantization process 145 characterised by ^^ℎ to produce a quantized hyper-latent. The} _{quantization process 145 characterised by ^^ℎ may be the same as the quantisation process 140} _{characterised by ^^ discussed above.} _{In a similar manner as discussed above for the quantized latent, the quantized hyper-latent is} _{then entropy encoded in an entropy encoding process 155 to produce a bitstream 135. The} _{bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the} _{quantized hyper-latent. The quantized hyper-latent is then used as an input to trained neural} _{network 125 acting as a hyper-decoder. However, in contrast to the compression pipeline 100,} _{the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder} _{115. Instead, the output of the hyper-decoder is used to provide parameters for use in the} _{entropy encoding process 150 and entropy decoding process 160 in the main compression} _{process 100. For example, the output of the hyper-decoder 125 can include one or more of} _{the mean, standard deviation, variance or any other parameter used to describe a probability} _{model for the entropy encoding process 150 and entropy decoding process 160 of the latent} _{representation. In the example shown in Figure 2, only a single entropy decoding process 165} _{and hyper-decoder 125 is shown for simplicity. However, in practice, as the decompression} _{process usually takes place on a separate device, duplicates of these processes will be present} _{on the device used for encoding to provide the parameters to be used in the entropy encoding} _{process 150.} _{Further transformations may be applied to at least one of the latent and the hyper-latent at any} _{stage in the AI based compression process 100. For example, at least one of the latent and the} _{hyper latent may be converted to a residual value before the entropy encoding process 150,155} _{is performed. The residual value may be determined by subtracting the mean value of the} _{distribution of latents or hyper-latents from each latent or hyper latent. The residual values} _{may also be normalised.} _{To perform training of the AI based compression process described above, a training set of} _{input images may be used as described above. During the training process, the parameters of} _{both the encoder 110 and the decoder 120 may be simultaneously updated in each training} _{step. If a hyper-network 105 is also present, the parameters of both the hyper-encoder 115} _{and the hyper-decoder 125 may additionally be simultaneously updated in each training step.} _{The training process may further include a generative adversarial network (GAN). When} _{applied to an AI based compression process, in addition to the compression pipeline described} _{above, an additional neutral network acting as a discriminator is included in the system. The} _{discriminator receives an input and outputs a score based on the input providing an indication} _{of whether the discriminator considers the input to be ground truth or fake. For example, the} _{indicator may be a score, with a high score associated with a ground truth input and a low} _{score associated with a fake input. For training of a discriminator, a loss function is used that} _{maximizes the difference in the output indication between an input ground truth and input fake.} _{When a GAN is incorporated into the training of the compression process, the output image 6} _{may be provided to the discriminator. The output of the discriminator may then be used in the} _{loss function of the compression process as a measure of the distortion of the compression} _{process. Alternatively, the discriminator may receive both the input image 5 and the output} _{image 6 and the difference in output indication may then be used in the loss function of the} _{compression process as a measure of the distortion of the compression process. Training of} _{the neural network acting as a discriminator and the other neutral networks in the compression} _{process may be performed simultaneously. During use of the trained compression pipeline} _{for the compression and transmission of images or video, the discriminator neural network is} _{removed from the system and the output of the compression pipeline is the output image 6.} _{Incorporation of a GAN into the training process may cause the decoder 120 to perform} _{hallucination. Hallucination is the process of adding information in the output image 6 that} _{was not present in the input image 5. In an example, hallucination may add fine detail to} _{the output image 6 that was not present in the input image 5 or received by the decoder 120.} _{The hallucination performed may be based on information in the quantized latent received by} _{decoder 120.} _{Details of a video compression process will now be described. As discussed above, a video is} _{made up of a series of images arranged in sequential order. AI based compression process} _{100 described above may be applied multiple times to perform compression, transmission} _{and decompression of a video. For example, each frame of the video may be compressed,} _{transmitted and decompressed individually. The received frames may then be grouped to} _{obtain the original video.} _{The frames in a video may be labelled based on the information from other frames that is used} _{to decode the frame in a video compression, transmission and decompression process. As} _{described above, frames which are decoded using no information from other frames may be} _{referred to as I-frames. Frames which are decoded using information from past frames may be} _{referred to as P-frames. Frames which are decoded using information from past frames and} _{future frames may be referred to as B-frames. Frames may not be encoded and/or decoded in} _{the order that they appear in the video. For example, a frame at a later time step in the video} _{may be decoded before a frame at an earlier time.} _{The images represented by each frame of a video may be related. For example, a number of} _{frames in a video may show the same scene. In this case, a number of different parts of the} _{scene may be shown in more than one of the frames. For example, objects or people in a scene} _{may be shown in more than one of the frames. The background of the scene may also be} _{shown in more than one of the frames. If an object or the perspective is in motion in the video,} _{the position of the object or background in one frame may change relative to the position of} _{the object or background in another frame. The transformation of a part of the image from} _{a first position in a first frame to a second position in a second frame may be referred to as} _{flow, warping or motion compensation. The flow may be represented by a vector. One or more} _{flows that represent the transformation of at least part of one frame to another frame may be} _{referred to as a flow map.} _{An example AI based video compression, transmission, and decompression process 200 is} _{shown in Figure 3. The process 200 shown in Figure 3 is divided into an I-frame part 201} _{for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be} _{understood that these divisions into different parts are arbitrary and the process 200 may be} _{also be considered as a single, end-to-end pipeline.} _{As described above, I-frames do not rely on information from other frames so the I-frame part} _{201 corresponds to the compression, transmission, and decompression process illustrated in} _{Figures 1 or 2. The specific details will not be repeated here but, in summary, an input image} _{^^0 is passed into an encoder neural network 203 producing a latent representation which is} _{quantised and entropy encoded into a bitstream 204. The subscript 0 in ^^0 indicates the input} _{image corresponds to a frame of a video stream at position t = 0. This may be the first frame of} _{an entire video stream or the first frame of a chunk of a video stream made up of, for example,} _{an I-frame and a plurality of subsequent P-frames and/or B-frames. The bitstream 204 is then} _{entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed} _{image ^^0 which in this case is an I-frame. The decoding step may be performed both locally} _{at the same location as where the input image compression occurs as well as at the location} _{where the decompression occurs. This allows the reconstructed image ^^0 to be available for} _{later use by components of both the encoding and decoding sides of the pipeline.} _{In contrast to I-frames, P-frames (and B-frames) do rely on information from other frames.} _{Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only} _{the input image ^^^^ that is to be compressed (corresponding to a frame of a video stream at} _{position t), but also one or more previously reconstructed images ^^^^−1 from an earlier frame} _{t-1. As described above, the previously reconstructed ^^^^−1 is available at both the encode} _{and decode side of the pipeline and can accordingly be used for various purposes at both the} _{encode and decode sides.} _{At the encode side, previously reconstructed images may be used for generating a flow maps} _{containing information indicative of inter-frame movement of pixels between frames. In the} _{example of Figure 3, both the image being compressed ^^^^ and the previously reconstructed} _{image from an earlier frame ^^^^−1 are passed into a flow module part 206 of the pipeline. The} _{flow module part 206 comprises an autoencoder such as that of the autoencoder systems of} _{Figures 1 and 2 but where the encoder neural network 207 has been trained to produce a} _{latent representation of a flow map from inputs ^^^^−1 and ^^^^ , which is indicative of inter-frame} _{movement of pixels or pixel groups between ^^^^−1 and ^^^^ . The latent representation of the flow} _{map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208.} _{On the decode side, the bitstream is entropy decoded and passed to a decoder neural network} _{209 to produce a reconstructed flow map ^^ .} _{The reconstructed flow map ^^ is applied to the previously reconstructed image ^^^^−1 to generate} _{a warped image ^^^^−1,^^. It is envisaged that any suitable warping technique may be used, for} _{example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen, D., Johnston,} _{N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized} _{video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and} _{Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further} _{envisaged that a scale-space flow approach as described in the above paper may also optionally} _{be used. The warped image ^^^^−1,^^ is a prediction of how the previously reconstructed image} _{^^^^−1 might have changed between frame positions t-1 and t, based on the output flow map} _{produced by the flow module part 206 autoencoder system from the inputs of ^^^^ and ^^^^−1.} _{As with the I-frame, the reconstructed flow map ^^ and corresponding warped image ^^^^−1,^^} _{may be produced both on the encode side and the decode side of the pipeline so they are} _{available for use by other components of the pipeline on both the encode and decode sides.} _{In the example of Figure 3, both the image being compressed ^^^^ and the ^^^^−1,^^ are passed} _{into a residual module part 210 of the pipeline. The residual module part 210 comprises an} _{autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the} _{encoder neural network 211 has been trained to produce a latent representation of a residual} _{map indicative of differences between the input mage ^^^^ and the warped image ^^^^−1,^^. The} _{latent representation of the residual map is then quantised and entropy encoded into a bitstream} _{212 and transmitted. The bitstream 212 is then entropy decoded and passed into a decoder} _{neural network 213 which reconstructs a residual map ^^ from the decoded latent representation.} _{Alternatively, a residual map may first be pre-calculated between ^^^^ and the ^^^^−1,^^ and the} _{pre-calculated residual map may be passed into an autoencoder for compression only. This} _{hand-crafted residual map approach is computationally simpler, but reduces the degrees of} _{freedom with which the architecture may learn weights and parameters to achieve its goal} _{during training of minimising the rate-distortion loss function.} _{Finally, on the decode side, the residual map ^^ is applied (e.g. combined by addition, subtraction} _{or a different operation) to the warped image to produce a reconstructed image ^^^^ which is a} _{reconstruction of image ^^^^ and accordingly corresponds to a P-frame at position t in a sequence} _{of frames of a video stream. It will be appreciated that the reconstructed image ^^^^ can then be} _{used to process the next frame. That is, it can be used to compress, transmit and decompress} _{^^^^+1, and so on until an entire video stream or chunk of a video stream has been processed.} _{Alternatively, the residual autoencoder may be trained to reconstruct the frame ^^^^ directly} _{from the entropy decoded bitstream by removing the connection between ^^^^−1,^^ and the output} _{of the residual block 210, thereby eliminating any direct combination step with the warped} _{previously decoded image to speed up inference. In this case, the flow information is intuitively} _{understood to be indirectly captured within the residual information, which the residual decoder} _{is able to learn to use to directly reconstruct the output image ^^^^ .} _{Alternatively, the residual autoencoder may be trained to reconstruct the frame ^^^^ directly from} _{the entropy decoded bitstream in combination with some representation of flow injected into} _{one or more layers of the residual decoder. In this case, the flow information is intuitively} _{understood to be indirectly captured within the injected information, which the residual decoder} _{is able to learn to use while decoding the latent representation of flow information to directly} _{reconstruct the output image ^^^^ .} _{Thus, for a block of video frames comprising an I-frame and ^^ subsequent P-frames, the} _{bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame} _{image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual} _{map of each P-frame image. For completeness, whilst not illustrated in Figure 3, any of the} _{autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those} _{described in connection with Figure 2. Accordingly, the bitstream may also contain hyper and} _{hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so} _{on, of those networks as applicable.} _{Finally, the above approach may generally also be extended to B-frames, for example as is} _{described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame} _{coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp.} 6680-6689). _{The above-described flow and residual based approach is highly effective at reducing the} _{amount of data that is to be transmitted because, as long as at least one reconstructed frame} _{(e.g. I-frame ^^^^−1) is available, the encode side only compresses and transmits a flow map} _{and a residual map (and any hyper or hyper-hyper parameter information, as applicable) to} _{reconstruct a subsequent frame.} _{Figure 4 shows an example of an AI image or video compression process such as that described} _{above in connection with Figures 1-3 implemented in a video streaming system 400. The} _{system 400 comprises a first device 401 and a second device 402. The first and second} _{devices 401, 402 may be user devices such as smartphones, tablets, AR/VR headsets or other} _{portable devices. In contrast to known systems which primarily perform inference on GPUs} _{such as Nvidia A100, Geforce 3090, Gefore 4090 GPU cards, the system 400 of Figure 4} _{performs inference on a CPU of the first and second devices respectively. That is, compute} _{for performing both encoding and decoding are performed by the respective CPUs of the first} _{and second devices 401, 402. This places very different power usage, memory and runtime} _{constraints on the implementation of the above methods than when implementing AI-based} _{compression methods on GPUs. In one example, the CPU of first and second devices 401, 402} _{may comprise a Qualcomm Snapdragon CPU.} _{The first device 401 comprises a media capture device 403, such as a camera, arranged to} _{capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404. The} _{video stream 404 is passed to a pre-processing module 406 which splits the video stream into} _{blocks of frames, various frames of which will be designated as I-frames, P-frames, and/or} _{B-frames. The blocks of frames are then compressed by an AI-compression module 407} _{comprising the encode side of the AI-based video compression pipeline of Figure 3. The} _{output of the AI-compression module is accordingly a bitstream 408a which is transmitted} _{from the first device 401, for example via a communications channel, for example over one} _{or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409} communications. _{The second device 402 receives the communicated bitstream 408b which is passed to an} _{AI-decompression module 410 comprising the decode side of the AI-based video compression} _{pipeline of Figure 3. The output of the AI-decompression module 402 is the reconstructed} _{I-frames, P-frames and/or B-frames which are passed to a post-processing module 411 where} _{they can prepared, for example passed into a buffer, in preparation for streaming 412 to and} _{rendering on a display device 413 of the second device 402.} _{It is envisaged that the system 400 of Figure 4 may be used for live video streaming at 30fps of} _{a 1080p video stream, which means a cumulative latency of both the encode and decode side} _{is below substantially 50ms, for example substantially 30ms or less. Achieving this level of} _{runtime performance with only CPU compute on user devices presents challenges which are} _{not addressed by known methods and systems or in the wider AI-compression literature.} _{For example, execution of different parts of the compression pipeline during inference} _{may be optimized by adjusting the order in which operations are performed using one or} _{more known CPU scheduling methods. Efficient scheduling can allow for operations to be} _{performed in parallel, thereby reducing the total execution time. It is also envisaged that} _{efficient management of memory resources may be implemented, including optimising caching} _{methods such as storing frequently-accessed data in faster memory locations, and memory} _{reuse, which minimizes memory allocation and deallocation operations.} _{A number of concepts related to the AI compression processes and/or their implementation} _{in a hardware system discussed above will now be described. Although each concept is} _{described separately, one or more of the concepts described below may be applied in an AI} _{based compression process as described above.}

_{Concept 1: Single-channel flow estimation} _{As described above, two commonly used color spaces in digital image processing are RGB} _{(Red, Green, Blue) and YUV. While RGB is widely used in computer graphics and digital} _{displays, YUV is frequently employed in video compression and transmission.} _{In more detail, RGB is an additive color model that represents colors as a combination of red,} _{green, and blue light. Each color channel is typically represented by an 8-bit value, ranging} _{from 0 to 255. The combination of these three channels allows for the representation of a wide} _{range of colors. In the RGB color space, a color is represented as a triplet (R, G, B), where R,} _{G, and B are the intensity values of the red, green, and blue components, respectively. For} _{example, (255, 0, 0) represents pure red, (0, 255, 0) represents pure green, and (0, 0, 255)} _{represents pure blue. Black is represented as (0, 0, 0), while white is represented as (255, 255,} _{255). YUV Color Space} _{In contrast, YUV separates the color information into two components: luminance (Y) and} _{chrominance (U and V). The YUV color space takes advantage of the human visual system’s} _{sensitivity to brightness and color information.} _{Luminance (Y)} _{The luminance (or Luma) component (Y) represents the brightness or intensity of a pixel. It is} _{calculated as a weighted sum of the RGB components:} _{^^ = ^^1^^ + ^^2^^ + ^^3^^} _{where ^^1, ^^2, ^^3 are weights assigned to each color channel based on the human eye’s} _{sensitivity to different wavelengths of light. The green channel is given the highest weight} _{because the human eye is most sensitive to green light. A non-limiting example set of weights} _{may be:} _{^^ = 0.299^^ + 0.587^^ + 0.114^^} _{The chrominance components (U and V) represent the color information in the YUV color} _{space. They are calculated by subtracting the luminance value from the blue and red color} _{channels, respectively, for example:} _{^^ = ^^ − ^^} _{^^ = ^^ − ^^} _{YUV can also be defined in a digitised representation referred to as BT.601. The BT.601} _{standard, also known as CCIR 601, provides recommendations for standard-definition digital} _{video. It specifies the digitization of the YUV color space components using an 8-bit} _{representation, essential for converting analog video signals into a digital format for television} _{broadcasts, DVDs, and early streaming video formats. In digital systems, the Y, U, and} _{V components are represented by 8-bit values, allowing for 256 levels of intensity ranging} _{from 0 to 255. This quantization process converts the continuous range of YUV values into} _{discrete levels suitable for digital storage and processing. The BT.601 standard defines the} _{conversion formulas from RGB to YUV in a digital context, taking into account digital system} _{characteristics and the need for efficient color information encoding.} _{In one illustrative example, the digital representation of the luminance (Y) component from} _{RGB values is given by the formula:} _{^^ = 16 + (65.481 · ^^ + 128.553 · ^^ + 24.966 · ^^)} _{The chrominance components may then calculated with the following formulas:} _{^^ = 128 + (−37.797 · ^^ − 74.203 · ^^ + 112.0 · ^^)} _{^^ = 128 + (112.0 · ^^ − 93.786 · ^^ − 18.214 · ^^)} _{In these formulas, R, G and B are the digital values of the red, green, and blue components,} _{normalized to the range of 0 to 1 (e.g., a value of 255 in an 8-bit system is represented as} _{1.0). The constants include offsets (16 for ^^ and 128 for ^^ and ^^) to center the chrominance} _{components, and scaling factors to adjust the amplitude of the signals within the 8-bit range.} _{More generally, in YUV color spaces, the U component represents the difference between the} _{blue channel and the luminance, while the V component represents the difference between the} _{red channel and the luminance.} _{By separating the luminance and chrominance information, YUV allows for more effective} _{compression techniques because the human visual system is more sensitive to changes in} _{brightness than changes in color, so the luminance component can be compressed with less} _{loss of perceived quality compared to the chrominance components.} _{YUV color space also allows for subsampling of the chrominance components to further reduce} _{the amount of data required for video transmission. Subsampling involves reducing the spatial} _{resolution of the U and V components while preserving the full resolution of the luminance} component. _{Known subsampling schemes include:} _{4:4:4 - No subsampling. The U and V components have the same resolution as the luminance} component. _{4:2:2 - The U and V components are subsampled horizontally by a factor of 2.} _{4:2:0 - The U and V components are subsampled both horizontally and vertically by a factor} _{of 2.} _{Subsampling reduces the amount of color information without significantly impacting the} _{perceived visual quality, as the human eye is less sensitive to color details than brightness} details. _{Consider next the illustrative flow-residual compression pipeline shown in Figure 5, which} _{shows flow encoder/decoder networks and residual encoder/decoder networks of an AI based} _{compression pipeline. This may be, for example, similar to the AI based compression pipeline} _{of the type shown in Figure 3.} _{The flow encoder neural network takes a current image ^^^^ and a previous image ^^^^−1, encodes} _{these into a flow latent representation which is optionally quantised and entropy encoded and} _{transmitted as a bit stream. On the decode side, the bitstream is received, entropy decoded into} _{the flow latent representation that the flow decoder neural network uses as input to produce} _{a representation of flow ^^ , e.g. a flow map. The flow map ^^ is applied to a previously} _{decoded image ^^^^−1 to generate a warped version of that previously decoded image ^^^^−1,^^ . The} _{warped version of the previously decoded image ^^^^−1,^^ is then fed into the residual encoder} _{network, together with the current image ^^^^ to produce a residual latent representation which is} _{optionally quantised and entropy encoded and transmitted as a bit stream. On the decode side,} _{the bitstream is received and entropy decoded back into the residual latent representation ^^ and} _{used by the residual decoder neural network, in combination with information from the warped} _{version of the previously decoded image ^^^^−1,^^ , to produce the reconstructed image ^^^^ . In the} _{case of Figure 5, the information associated with the warped previously decoded image ^^^^−1,^^} _{is optionally first processed by module ^^ , referred to herein after as a composition adapter, for} _{example to downsample and/or pad it before it is fed into the residual decoder together with} _{the entropy decoded residual latent representation to produce the final reconstructed image} _{^^^^ . This process may then be repeated for ^^^^+1 and so on to encode, transmit, and decode a} _{sequence of frames.} _{An example implementation of the flow encoder neural network 207 is shown in Figure 6.} _{Figure 6 illustrates an example of a flow module, in this case a network 600, configured to} _{estimate information indicative of a difference between an image ^^^^−1 and an image ^^^^ , e.g.} _{flow information. Figure 6 is provided as an example of how such flow information may be} _{calculated between two input or output images.} _{The network 600 comprises a set of layers 601a, 601b respectively for an image ^^^^−1 and an} _{image ^^^^ from respective times or positions in a sequence ^^ − 1 and ^^ of a sequence of image} _{frames. The set of layers 601a, 601b may define one or more convolution operations and/or} _{nonlinear activations in the layers to sequentially downsample the input images to produce} _{a pyramid of feature maps for different levels of coarseness or spatial resolution. This may} _{comprise performing ℎ/2 ^^/2 downsampling in a first layer, ℎ/4 ^^/4 downsampling in a} _{second layer ℎ/8 ^^/8 downsampling in a third layer, ℎ/16 ^^/16 downsampling in a fourth} _{layer, ℎ/32 ^^/32 downsampling in a fifth layer, ℎ/64 ^^/64 downsampling in a sixth layer,} _{and so on. It will of course be appreciated that these downsampling operations and levels of} _{coarseness or spatial resolution of a pyramid feature map are exemplary only and others levels} _{are also envisaged.} _{With the downsampling operations performed and the corresponding pyramid of feature maps} _{generated, a first cost volume 602 is calculated at the most course level between the pixels} _{of the first image ^^^^−1 and the corresponding pixels of in the second image ^^^^ . Cost volumes} _{define the matchmaking cost of matching the pixels in the initial image with the pixels in the} _{later image. That is, the closeness of each pixel, or a subset of all pixels, in the initial image} _{to one or more pixels in the later image is determined with a measure of closeness such as a} _{vector or dot product, a cosine similarity, a mean absolute difference, or some other measure} _{of closeness. This metric may be calculated against all pixels in the later image, or only for} _{pixels in a predetermined search radius such as a 1-10 pixel radius (preferably a 1, 2, 3, or 4} _{pixel radius), or some other radius as described in connection with concept 4 below, around} _{the pixel coordinate corresponding to the pixel against which the measure of closeness is being} calculated. _{Finally, a first flow 603 can be estimated from the first cost volume 602. This may be achieved} _{using, for example a flow extractor network which may comprise a convolutional neural} _{network comprising a plurality of layers trained to output a tensor defining a flow map from} _{the input cost volumes. Other methods of calculating flow information from cost volumes will} _{also be known to the skilled person.} _{The same process is then repeated for the other levels of coarseness to calculate a second cost} _{volume 604 and second flow 605, and so on for the cost volumes and flows associated with} _{each of the levels of coarseness have been calculated, up to the final cost volume 606 and flow} 607. _{The weights and/or biases of any activation layers in network 600 (e.g. optionally in the} _{downsampling convolution layers and/or in a flow extractor network that produces flow maps} _{from the cost volumes) are trainable parameters and can accordingly be updated during training} _{either alone, or in an end to end manner with the rest of the compression pipeline. The trainable} _{nature of these parameters provides the network 600 with flexibility to produce feature maps at} _{each level of spatial resolution (i.e. pyramid feature maps) and/or at the flow outputs that are} _{forced into a distribution that best allows the network to meet its training objective (e.g. better} _{compression, better reconstruction accuracy, more accurate reconstruction of flow, and so on).} _{For example, it allows the network 600 to produce feature maps that, when cost volumes and/or} _{flow are calculated therefrom, produce cost volumes or flows that are distributed roughly} _{matching the latent representation distribution that would previously have been expected to be} _{output by a dedicated flow encoder module. This effectively allows a dedicated flow encoder} _{to be omitted entirely from the flow compression part of the pipeline, as is shown in the} _{illustrative example of Figure 7a, descibed later.} _{Optionally, for each level of coarseness or resolution, the flow of the previous level or levels of} _{coarseness or resolution may be used to warp 608, 609, the feature maps before the cost volume} _{is calculated. This has the effect of artificially reducing the amount of relative movement} _{between the pixels of the t and t - 1 images or feature maps when calculating the cost volumes,} _{reducing flow errors for high movement details. The inventors have realized that removing} _{warping entirely or in some levels of coarseness or resolution can substantially reduce runtime} _{of flow calculation while maintaining good levels of flow accuracy.} _{As the warping process uses inputs from different levels of coarseness or spatial resolution, the} _{flow estimation output may be upsampled 610, 611 first to match the coarseness resolution of} _{the feature map to which the flow is being applied in the warping process.} _{The outputs of the flow module may accordingly be one or more cost volumes or some} _{representation thereof, and/or one or more flows or some representation thereof).} _{Note that whilst the term cost volume has been used above, the information indicative of} _{differences between the respective inputs need not be a strict cost volume in the mathematical} _{sense, but may be any representation of this information. For example a compressively} _{calculated cost volume, applying for example the principles of compressive sensing to estimate} _{an approximate cost volume by sampling only a small number of pixel differences compared} _{to performing a complete pixel-wise cost volume calculation. This compressively calculated} _{cost volume approach may be applied to all embodiments described herein.} _{As described above, it will be appreciated that running a flow-based compression pipeline once} _{training has been completed relies on estimation of flow, whether by handcrafted algorithm} _{or through some trained network. The output estimated flow itself may be compressed and} _{transmitted in a bitstream, which is typically done by a dedicated flow encoder network} _{that encodes the flow information into a latent representation distributed according to a} _{distribution that can be efficiently entropy encoded. Irrespective of the flow estimation} _{approach taken, dedicated flow encoders increase run time and partly contribute to preventing} _{learned compression codecs from running in real time or near real time.} _{Consider for example the pipeline of Figure 3. At the encode side, a previously reconstructed} _{image and a new, to-be-encoded image may be used for generating a flow maps containing} _{information indicative of inter-frame movement of pixels between frames. In the example of} _{Figure 3, both the current image being compressed ^^^^ and the previously reconstructed image} _{from an earlier frame ^^^^−1 are passed into a flow module 207 that typically has two parts: a} _{flow estimation part, and a dedicated encoder part that encodes the estimated flows into a} _{latent representation that can be efficiently entropy encoded. The dedicated flow encoder is} _{effectively a specialized network dedicated to producing latent representations of flow that are} _{distributed as close to an optimally entropy encodable distribution as possible. In Figure 3,} _{these two parts are shown as a single component.} _{The flow estimation part may comprise, for example, the flow module 600 of Figure 6. That is,} _{the flow module 600 produces its cost volumes and flows, then passes one or more of these} _{into the second part: the dedicated flow encoder network that is trained to produce the latent} _{representation of flow that can be efficiently entropy encoded before being sent in a bitstream} _{to the decoder. This approach is slow because we first have to calculate the cost volumes and} _{flows before we can encode them into a latent representation, which itself is a slow process. We} _{then do the entropy encoding of the latent representation of the cost volume(s) and/or flow(s)} _{into a bitstream which is finally transmitted. This multi-step approach increases run time.} _{Instead, in order reduce compute and runtime overhead, the present disclosure also envisages} _{omitting the dedicated flow encoder. For example, the outputs of the flow module 600 output(s)} _{are directly entropy encoded and fed into the bitstream without first encoding them into a} _{latent representation. Given that the process of encoding flow module output(s) into a latent} _{representation with a flow compression encoder is computationally expensive, removing this} _{component entirely from the flow compression module 207 results in a significant decrease in} _{runtime, thereby contributing to the goal of being able to run the pipeline in inference in real} _{time or near real time.} _{This is, in part, made possible by virtue of the trainable nature of the flow module 600,} _{whereby the weights and biases of one or more of the convolution and/or activation layers of} _{the flow module 600 result in cost volumes or flows that are already distributed according to a} _{distribution that corresponds roughly to that which a dedicated flow encoder may produce.} _{This is illustrated in more detail in Figure 7a.} _{Figure 7a illustratively shows two flow compression modules 700a and 700b which may be used} _{as flow module 207 in Figure 3. The same reference numerals are used for like-components.} _{In the first module 600a, two input images ^^^^ and ^^^^−1 are passed 701a, 701b are input into a} _{flow module such as flow module 600 of Figure 6, which produces pyramid feature maps 702} _{of different levels of coarseness, and corresponding cost volumes and/or flows 703. The final} _{flow estimation is then passed to a dedicated flow compression encoder 704 which encodes} _{it to produce a latent representation of the final flow estimation. The output may thus be a} _{latent representation of one or more of a H W, H/2 W/2, H/4 W/4, H/8 W/8, H/16 W16. H/32} _{W/32, H/64 W/64, or some other resolution, cost volumes and/or flows. This is finally entropy} _{encoded into a bitstream 705, and transmitted. The bitstream 705 is entropy decoded and} _{the decoded bitstream is passed to a decoder 606 which reconstructs the cost volumes and/or} _{flows which may be used in a flow-based approach as described above in the general concepts} section. _{However, as described above, the approach of the flow compression module 700a with a} _{dedicated flow compression encoder 704 is slow and computationally expensive.} _{In contrast, the present inventors have realized that omitting the dedicated flow compression} _{encoder 704 entirely and instead directly entropy encoding one or more outputs of the flow} _{module 600 into the bitstream 705 without first passing it through the dedicated flow encoder} _{704 results in a substantial speed up at run time. Counter-intuitively, this removal of the} _{dedicated flow compression encoder 704 does not appreciably appear to effect the performance} _{of the rest of the compression pipeline both in terms of distortion and bit rate. This modified} _{approach is illustrated with flow compression module 700b in Figure 7a where it is apparent} _{that the flow encoder 704 of flow compression module 700b has effectively been chopped out.} _{Thus, instead of encoding the cost volumes and/or flows into a latent representation before} _{the entropy encoding step, the one or more of a H W, H/2 W/2, H/4 W/4, H/8 W/8, H/16} _{W16. H/32 W/32, H/64 W/64, or some other resolution, cost volumes and/or flows are simply} _{entropy encoded directly and sent out in the bitstream.} _{This approach is based on the insight that flow modules such as flow module 600 have trainable} _{parameters and accordingly have a great deal of flexibility in terms of the distribution of} _{outputs they can be trained to produce. The inventors have found that when the dedicated flow} _{compression encoder 704 is included, the flow module 600 has no need to produce outputs in} _{a distribution that can be efficiently entropy encoded because the neural networks that make up} _{the compression pipeline as a whole are simply able to rely on the dedicated flow encoder 704} _{to minimize any contribution to bitrate that the flow information has in the bitstream. However,} _{when the dedicated flow compression encoder 704 is removed, networks of the compression} _{pipeline are no longer able to rely on the dedicated flow encoder 704. In its place, the inventors} _{have found that the flow module 600 learns during training to compensate by outputting cost} _{volumes and/or flows that are similarly distributed as those that would be output by a dedicated} _{flow compression encoder 704. This can be understood as the flow module 600 (that is, the} _{trainable networks within it such as the flow extractor network and/or activation layers in the} _{convolution layers) effectively being forced to mimic the dedicated flow compression encoder} _{704 during training when the loss is being minimized because the system can no longer rely on} _{the (now removed) dedicated flow compression encoder 704 to perform that task.} _{It is envisaged that the training of the flow module 600 may either be performed in an end-to-end} _{manner together with the rest of the compression pipeline or alternatively in a student-teacher} _{approach where the network of the flow module 600 is the student component, and a known} _{optical flow model and pre-trained flow compression network is the teacher component.} _{Additionally or alternatively, the training of the flow module 600 may be performed separately} _{using data on which the groundtruth flow is known. For example, by using 3D animation video} _{data whose groundtruth flow is known a priori through the animation program used to generate} _{it, or using auto flow genreation methods.} _{This counter-intuitive removal of the dedicated flow compression encoder 704 from the flow} _{compression module to force the other components to effectively take on the tasks previously} _{performed by the dedicated flow compression encoder 704 contributes significantly to a} _{speed up at run time speed on the encoding side of the compression pipeline. This approach} _{accordingly makes a substantial stride forward towards the goal of achieving real time or near} _{real time performance during inference.} _{Figure 7b illustrates the introduction of the flow compression modules 700b into a flow-residual} _{compression pipeline, such as that of Figure 3, by illustratively showing the encoding and} _{decoding of p-frames of an image stream, whereby a groundtruth image and its previously} _{encoded and decoded reconstruction at t-1 are available. This corresponds generally to the} _{flow-residual compression pipeline shown in Figure 3 and accordingly uses the same reference} _{numbers for corresponding features. However in this case, the output of the flow module 702} _{is natively in the distribution of a latent representation and can be immediately transmitted to} _{the decoder, without needing a standalone dedicated flow compression encoder. In addition, it} _{is envisaged that the warped reconstructed, previously decoded image from t-1 may be passed} _{directly to the residual decoder, in addition to whatever is output by the residual encoder 211.} _{This provides the residual decoder with the additional context of the warped reconstructed,} _{previously decoded image.} _{Returning now to the YUV colour space, the presently described pyramidal estimation of} _{flow in AI-based compression pipelines provides some unexpected synergies when working} _{in YUV colour space. Traditionally, AI-based compression research is performed on RGB} _{data in RGB space. In RGB space, as the input tensors (with R, G and B channels) propagate} _{through the networks, the channels in the tensor are all treated equally. For example, during} _{flow estimation, such as described above in connection with Figures 6 to 7b, the pyramid layers} _{and associated operations are performed treatimg the R, G and B channels of the input tensors} _{equally. This is because, in RGB space, all channels contain an approximately equal amount} _{of movement information in a frame sequence.} _{The present inventors have realised that, when moving out of RGB space and into YUV} _{space, the flow information is largely contained in the luma channel. Accordingly, flow can} _{be estimated using the flow module operating only on the luma channel. Cutting the number} _{of channels that get fed through the pyramid layers from three channels to single channel} _{results in a substantial speed up with very little loss in overall performance in terms of image} _{reconstruction accuracy (e.g. measured by distortion such as an MSE score) and compression} _{performance (e.g. measured in in bits per pixel).} _{Accordingly the present concept 1 is directed to replacing a multi-channel input tensor to the} _{flow modules shown in Figures 6 to 7b with a single-channel tensor. Whilst this is envisaged} _{to be the luma channel when working with YUV data, this concept may also be generalised to} _{other types of data where information in one of the channels contains enough information to} _{sufficiently estimate flow, as well as the generation of custom data types comprising a plurality} _{of channels where one of the channels is optimised for enabling an AI-based flow module such} _{as that shown in Figures 6 to 7b to determine flow. For completeness, it is noted that concept 1} _{may also be combined with the other concepts described herein.} _{Figure 8 illustratively shows a modified flow module 800 corresponding to that of Figure 6,} _{like numbered reference are used for like components. Additionally, a pre-processing module} _{801 is introduced before the flow module 800. This may form part of the flow module 800} _{itself or may form part of some other component of the pipeline, for example as a component} _{of a data intake module (not shown) or other pre-processing modules.} _{The pre-processing module is configured to select the luma channel from the a YUV input to} _{feed into the flow module 800, after which the flow module 800 operates as described above in} _{relation to Figure 6.} _{The pseudocode provided below illustratively shows an exemplary operation of the pre-} _{processing module, configured to select a luma channel from an input in YUV format:}

_{Algorithm 1 Select Luma Channel from YUV Tensor} _{procedure SelectLumaChannel(^^^^^^^^^^^^^^^^^^^^^^)} _{^^^^^^^^ℎ^^ ← HeightOf(^^^^^^^^^^^^^^^^^^^^^^)} ^^^^^^^^ℎ ←WidthOf(^^^^^^^^^^^^^^^^^^^^^^) ^_{^^^^^^^^^ℎ^^^^^^^^^^ ← CreateMatrix(^^^^^^^^ℎ^^, ^^^^^^^^ℎ)} _{for ^^ ← 1 to ^^^^^^^^ℎ^^ do} _{for ^^ ← 1 to ^^^^^^^^ℎ do} _{^^^^^^^^^^ℎ^^^^^^^^^^ [^^] [ ^^] ← ^^^^^^^^^^^^^^^^^^^^^^ [1] [^^] [ ^^]} _{end for} _{end for} _{Return: ^^^^^^^^^^ℎ^^^^^^^^^^} _{end procedure} _{It will be apprecaited that the above example is illustrative only and other methods of} _{implementing a luma channel selection method will be known by the skilled person.} _{Concept 2: Downsampled YUV warping} _{One issue that arises in the implementation of concept 1 described above is the facilitating} _{of warping operations (for example as shown in Figure 6) performed during flow estimation} _{when working in YUV space.} _{In traditional compression, warping is used to exploit temporal redundancy between consecutive} _{frames. The goal of warping is to estimate and compensate for the motion of objects or regions} _{within a video sequence, allowing for more efficient compression by reducing the amount of} _{information that is to be encoded. A frame is typically divided into smaller blocks, such as} _{macroblocks or coding units, which are then processed independently. Warping is performed} _{on these blocks to find the best matching block in a reference frame, usually a preceding or} _{succeeding frame, and to generate a motion vector that describes the displacement between the} _{current block and its best match.} _{The warping process can be described mathematically using a motion model. One commonly} _{used motion model is the affine motion model, which assumes that the motion of a block can} _{be represented by a linear transformation. The affine motion model is defined by a 2x3 matrix:} ^ ^{^ ^} ^{^} ^^₁₁ ^^₁₂ ^^_^^ ^{^} ^{^} ^{^} _^ _^ _^ _{where ^^11, ^^12, ^^21, ^^22 represent the and shear parameters, and ^^^^ , ^^^^ represent} _{the translation parameters.} _{Given a pixel (^^, ^^) in the current block, its corresponding location (^^′, ^^′) in the reference} _{frame can be computed using the affine transformation:} _^ _{^ ^} ′_^ ^{^} ^{^ ^ ^} _^ _^ ^{^ ^} ^ ^^ _^ ^^₁ ^^ ^{^ ^} ^^ ^{^} ^^ ^{^} ^_^ ^{^} _{1 12} ^{^ ^} _^ _^ ^{^} _^^ ^{^} ^{^} _^ _^ _^ _{The motion estimation process involves searching for the best matching block in the reference} _{frame that minimizes a certain distortion metric, such as the sum of absolute differences (SAD)} _{or the sum of squared differences (SSD). By way of example, the SAD metric can be expressed} as: ∑ ^{^^−}^^{1 ^}∑^{^−}^¹ ^_{^^^^^ (^^, ^^) = (^^, ^^) − ^^(^^ + ^^, ^^ + ^^) |} _{where ^^ (^^, ^^)} ^_{^, ^^ + ^^) represents the} _{pixel values of the candidate block in the reference frame shifted by (^^, ^^), and ^^ is the block} size. _{The motion estimation search can be performed using various algorithms, such as full search,} _{three-step search, or diamond search, to find the motion vector (^^^^, ^^^^) that minimizes the} _{distortion metric:} _{(^^^^, ^^^^) = arg} ( _m ^^_i ,^_n ^) _{^^^^^^ (^^, ^^)} _{Once the motion vector is determined, the current block can be reconstructed by copying the} _{pixels from the reference frame at the displaced location indicated by the motion vector. This} _{process is known as motion compensation.} _{However, the affine motion model has limitations in representing complex motion patterns,} _{such as non-rigid or deformable objects. More advanced motion models, such as the projective} _{motion model or the elastic motion model, can be employed in traditional compression as an} _{alternative. These models introduce additional parameters to capture more complex motion} _{patterns at the cost of increased computational complexity.} _{The projective motion model, for example, is defined by a 3x3 homography matrix:} ^{^} ^{^} ^{^} ^^′ ^{^} ^{^ ^} ^{^} ℎ₁₁ ℎ₁₂ ℎ₁₃ ^{^} ^{^ ^} ^{^} ^^ ^{^} ^{^} ^{^ ^} ^{^ ^} ^{^ ^ ^ ^} ^{^ ^ ^ ^} ^{^ ^} ^{^ ^} ^ ^^^{^} ^_^ _{^ ^} _{^ ^} _^ 1_^ _^ _{where ^^′ is a scaling factor, and the final coordinates are obtained by dividing ^^′ and ^^′ by ^^′.} _{Elastic motion models, such as the free-form deformation (FFD) model, allow for even more} _{flexibility in representing complex motion. FFD models define a deformation grid over the} _{image and use spline interpolation to compute the displacement of each pixel based on the} _{grid points.} _{The choice of motion model depends on the characteristics of the video content and the} _{desired trade-off between compression efficiency and computational complexity. In practice,} _{video compression standards, such as H.264/AVC or H.265/HEVC, often use hierarchical} _{motion estimation and compensation techniques, where the video frames are decomposed into} _{a pyramid of resolutions, and motion estimation is performed at each level to capture both} _{large-scale and small-scale motion.} _{In contrast, in AI-based compression, warping has a more indirect role: facilitating the} _{estimation of a more accurate representation of flow at different resolutions using trained} _{neural network pyramid layers, as shown in e.g. Figure 6, which in turn can facilitate the more} _{accurate reconstruction by a fully neural network based residual encoder/decoder module, as} _{shown in Figure 6b.} _{The present inventors have realised that there are some improvements that can be made to} _{AI-based flow estimation when working in YUV 4:2:2 and/or 4:2:0 space. For example,} _{when the inputs into the flow encoder of Figure 6 are luma (Y) channels in full resolution,} _{and the associated UV channels are in a non-full resolution (using for example the 4:2:0, or} _{4:2:2 subsampling scheme), the output flow representation that is fed into the warping step is} _{in full-resolution. Downstream, the final output that is eventually derived from the warped} _{frame is to be assembled from the luma and chroma information. Accordingly, the chroma} _{information still has to be reintroduced and recombined with the luma information in some way.} _{That is, if the chroma information is missing entirely from the outputs of the flow estimation} _{steps then it would have to be sent separately in the bitstream so that the decode side gets the} _{chroma information one way or another. Instead of sending chroma information separately, it} _{is combined again with the luma information during warping.} _{Accordingly, the luma (Y) channel information from which flow is being estimated is} _{recombined with the chroma (UV) channel information before warping is performed on the} _{combined YUV information. The present inventors have realised that the overall effect on} _{image reconstruction accuracy and compression performance that the warping step provides in} _{the architecture of Figure 6 is approximately the same irrespective of whether the warping} _{step is performed in the full-resolution of the luma (Y) channel or in the half-resolution of the} _{chroma (UV) channels. However performing warping in the half-resolution of the chroma} _{(UV) channels is substantially faster computationally. Accordingly, the full-resolution luma} _{(Y) channel information and any flow information derived therefrom can be downsampled to} _{match the lower resolution of the (UV) channels before the warping step is performed in this} _{lower resolution. The resulting warped, combined YUV information is then used to estimate} _{the flow for the next resolution layer of the feature pyramid.} _{The same approach may be applied to each resolution layer of the feature pyramids of the} _{architecture of Figure 6 to provide an overall speed up of the flow encoder module of 1-10} _{milliseconds across a wide variety of different hardware devices, thereby contributing to the} _{goal of real time encoding and decoding speeds.} _{The above described approach can also be generalised beyond YUV to any input data with} _{multiple channels where one or more of the channels is in a different resolution to the rest of} _{the channels in the input data.} _{Figure 9 illustratively shows an example modified flow module 900 of the present disclosure} _{demonstrating a non-limiting implementation example of concept 2. As with Figure 8, the} _{input ^^^^−1 and ^^^^ are pre-processed to select respectively the luma and chroma channels. This} _{may be performed with separate pre-processing modules 901a, 901b, or with a single module} _{or with some other component of a data intake module of the pipeline, as will be appreciated} _{by the skilled person. The luma channel is fed into the feature pyramid layers as in Figure} _{6, but the chroma channels are instead fed directly into the warping steps 608, 609. For the} _{downsampling pyramid layer where the native chroma channel resolution matches the luma} _{channel resolution, no further downsampling of the chroma channel is performed. For the} _{downsampling pyramid layer where the native chroma channel resolution does not match the} _{luma channel resolution, a downsampling step (not shown) may be performed on the chroma} _{channels so that the warping is performed using matching luma and chroma channel resolutions.} _{The operation of the pre-processing modules 901a and 901b to select the respective luma and} _{chroma channels may correspond to that described in relation to concept 1 above.} _{It will further be appreciated that concepts 1 and 2 may also be combined alone or together} _{with concept 3 described below.} _{Concept 3: Channel dimension stuffing} _{In traditional AI-based compression, operating in RGB space with equal resolution R, G and B} _{channels, the implementation of the convolution operations of any of the neural networks is} _{straightforward and requires no special consideration as the shape of the tensor is defined by} _{three channels of equal dimensions. The same consideration applies to 4:4:4 YUV input data} _{where each of the Y, U and V channels are in the same resolution so the tensor comprises three} _{channels of equal dimensions. However, the situation is different when considering 4:2:2 or} _{4:2:0 YUV data because the chroma channels (UV) are of different dimensions to the the luma} _{channel (Y). The shape of the tensor is accordingly more complicated and convolutions of the} _{neural networks of the AI-based compression pipeline cannot be applied in a straightforward} _{way to a non-uniformly shaped tensor.} _{The present concept 3 is directed to solving this problem.} _{One approach is to upsample the two chroma (UV) channels to match the resolution of the} _{luma (Y) channel to produce a uniformly shaped input tensor (effectively pre-processing YUV} _{4:2:0 or YUV 4:2:0 into YUV 4:4:4 data) before performing any convolutions on the, now} _{uniformly shaped, tensor.} _{This approach is illustrated in more detail in the following pseudocode:} _{Algorithm 2 Upsample YUV 4:2:0 Tensor to YUV 4:4:4 Tensor} _{function upsample_yuv420_to_yuv444_tensor(yuv420_tensor)} _{⊲ Assuming yuv420_tensor has shape [height, width, 3]} _{⊲ where the last dimension represents Y, U, and V channels} _{height, width, _ ← shape(yuv420_tensor)} _{⊲ Extract the Y, U, and V channels from the yuv420_tensor} _{y_channel ← yuv420_tensor[:, :, 0]} _{u_channel ← yuv420_tensor[::2, ::2, 1]} _{v_channel ← yuv420_tensor[::2, ::2, 2]} _{⊲ Perform upsampling for U and V channels} _{upsampled_u_channel ← upsample(u_channel, scale_factor=2)} _{upsampled_v_channel ← upsample(v_channel, scale_factor=2)} _{⊲ Stack the Y channel with the upsampled U and V channels} _{yuv444_tensor ← stack([y_channel, upsampled_u_channel, upsampled_v_channel], axis=-1)} _{Return: yuv444_tensor} _{end function}

_{That is, the method may comprise receiving an input tensor in YUV 4:2:0 format, this may} _{be, for example, the format in the input data has been captured natively by an image capture} _{device, whereby the input tensor has a height dimension and width dimension representing x-y} _{coordinates in an image comprising a plurality of pixels, and whereby the input tensor has} _{three channels made up of a luma channel (Y) and two chroma channels (UV).} _{The shape of the tensor is extracted and upsampling is performed using a scale factor of 2 on} _{the U and V channels. The Y channel is left alone. The Y channel is then stacked with the now} _{upsampled U and V channels and the resulting tensor is returned. The returned tensor now has} _{a uniform shape where all the channels have the same resolution and accordingly convolutions} _{can be performed on the tensor in the usual way.} _{The upsampling may comprise one or more of: nearest neighbour interpolation, bilinear} _{interpolation, bicubic interpolation, a transposed convolution (deconvolution) or any other} _{upsampling technique known to the skilled person.} _{One problem with this approach however is that it can be slow. This is because (i) upsampling} _{two whole channels requires additional computation time and (ii) the rest of the pipeline then} _{operates in a higher resolution which means there are more computations overall (e.g. in} _{the flow estimation, the warping, the residual estimation, and so on). This first approach is} _{accordingly referred to as a naive approach as focuses on simplicity and does not take into} _{account any synergies that YUV 4:2:0 may have with an AI-based compression pipeline.} _{However this approach is computationally slow and substantially increases run time as the} _{entire compression pipeline would be running in the higher YUV 4:4:4 space.} _{Another naive approach is to achieve matching tensor dimensions by introducing two or more} _{additional convolution operations to an input layer of the pipeline that operate with one stride} _{for the Y channel and a different stride for the UV channels. For example, performing a} _{convolution on the Y channel with a stride of 4 and a convolution on the UV channels with a} _{stride of 2 to produce Y, U and V channels of equal dimensions that can then be summed to} _{produce the overall uniformly shaped YUV tensor to feed into the rest of the neural networks} _{in the usual way.} _{This approach is illustrated in the following pseudocode:}

_{Algorithm 3 Perform Convolutions on YUV 4:2:0 Tensor} _{function convolve_yuv420_tensor(yuv420_tensor)} _{⊲ Assuming yuv420_tensor has shape [height, width, 3]} _{⊲ where the last dimension represents Y, U, and V channels} _{⊲ and the U and V channels have half the resolution of the Y channel} _{height, width, _ ← shape(yuv420_tensor)} _{⊲ Define the convolution kernels for Y and UV channels} y_kernel← get_convolution_kernel(stride=4) uv_kernel← get_convolution_kernel(stride=2) ⊲_{Extract the Y, U, and V channels from the yuv420_tensor} _{y_channel ← yuv420_tensor[:, :, 0]} _{u_channel ← yuv420_tensor[::2, ::2, 1]} _{v_channel ← yuv420_tensor[::2, ::2, 2]} _{⊲ Perform convolution on the Y channel with stride 4} _{convolved_y_channel ← convolve(y_channel, y_kernel, stride=4)} _{⊲ Perform convolution on the U and V channels with stride 2} _{convolved_u_channel ← convolve(u_channel, uv_kernel, stride=2)} _{convolved_v_channel ← convolve(v_channel, uv_kernel, stride=2)} _{⊲ sum the convolved Y, U, and V channels} _{convolved_tensor ← sum([convolved_y_channel, convolved_u_channel, convolved_v_channel], axis=-1)} _{Return: convolved_tensor} _{end function} _{That is, the method may comprise receiving an input tensor in YUV 4:2:0 format, this may} _{be, for example, the format in the input data has been captured natively by an image capture} _{device, whereby the input tensor has a height dimension and width dimension representing x-y} _{coordinates in an image comprising a plurality of pixels, and whereby the input tensor has} _{three channels made up of a luma channel (Y) and two chroma channels (UV). The shape of} _{the tensor is identified and the respective convolution kernels for the Y channel and for the} _{UV channels are identified. The Y, U and V channels are identified, the stride 4 convolution} _{is performed on the Y channel, and the stride 2 convolution is performed on the U and V} _{channels. The resulting convolved channels now have matching dimensions so can be summed} _{to produce an overall uniformly shaped tensor which can be fed into the neural networks of the} _{AI-compression pipeline in the usual way.} _{This approach is an improvement over the first naive approach because the resulting tensor} _{after the convolution does not have upsampled U and V channels so the runtime of the overall} _{pipeline is faster.} _{However, the present inventors have realised that a third approach exists that is substantially} _{faster than either of the above naive approaches, and synergistically also results in improved} _{accuracy in terms of image reconstruction and performance in terms of compression rates.} _{Thus providing a way to handle the non-uniform shape of the YUV 4:2:0 data in a way that} _{actually improves overall performance of the compression pipeline.} _{This third approach is to emulate a single convolution operation on the non-uniform input by} _{performing two different space to depth pixel unshuffle operation on the non-uniform input} _{tensor. Specifically, a space to depth pixel unshuffle operation with a block size of 4 is applied} _{to the luma channel, and a space to depth pixel unshuffle operation with a block size of 2 is} _{applied to each of the chroma channels. These are then stacked to produce the final tensor} _{comprising 24 channels. That is, the space to depth pixel unshuffle operation with a block size} _{of 4 takes the luma channel and applies a 4x4 block to the pixels of the luma channel which} _{are distributed in the channel dimension to produce 16 channels. Similarly, the space to depth} _{pixel unshuffle operation with a block size of 2 takes the U channel and applies a 2x2 block} _{to the pixels of the U channel and distributes these in the channel dimension to produce 4} _{channels, and the same applies to the V channel which produces 4 more channels to give an} _{overall 24 channels.} _{This approach results in two advantages over the above-described naive approaches.} _{Firstly, the present inventors have realised that, on many hardware devices, it is faster to} _{perform operations on data that has a smaller spatial resolution and larger channel dimension} _{than it is to perform the same operations on larger spatial resolutions but smaller channel} _{dimensions. Accordingly, even though the input and outputs of the overall pipeline are the} _{same, converting the input YUV data into a smaller spatial resolution format with more channel} _{dimensions results in a significant speed up of the overall pipeline of the order of milliseconds.} _{Secondly, the space to depth pixel unshuffle operations have the synergistic effect of increasing} _{the receptive field of any subsequent convolution operations of the neural networks as} _{they perform their respective convolutions on the input tensor (for example, in the flow} _{encoder/decoder and/or residual encoder/decoder). That is, in a given convolution window,} _{the output can only be influenced by whatever information is in the input. If a convolution} _{window is only a single ^^^^^^ grid of pixels of a single channel, the receptive field is the single} _{^^^^^^ grid of pixels of that channel. If we add additional channels to that convolution window,} _{the output can now be influenced by the additional channel information. If we distribute pixels} _{that originated from outside the ^^^^^^ grid into one or more additional/new channels that are} _{included in the convolution window, then we are allowing the output to be influenced by these} _{new pixels from outside the ^^^^^^ grid, thus indirectly increasing the receptive field of the} _{convolution without needing to increase the spatial dimensions of the ^^^^^^ grid.} _{In turn, the increased receptive field allows the neural networks of the pipeline to harness} _{spatial correlations between pixels better to thereby more efficiently learn what information is} _{redundant and can be compressed away, thereby achieving improved compression rate and} _{improved image reconstruction accuracy for a given bit rate.} _{Thus, in the presently described example, distributing the spatial dimension information into} _{the channel dimensions using the applicable space to depth pixel unshuffle operations on the} _{respective Y and UV channels solves the problem of applying convolutions throughout the} _{pipeline to the non-uniformly shaped YUV 4:2:0 input tensor while, at the same time, the} _{resulting 24-channel tensor achieves improved compression rates and image reconstruction} _{accuracy in a way that is computationally efficient and achieves runtime speed ups of the order} _{of milliseconds.} _{The above approach is illustrated in more detail in the pseudocode below:} _{Algorithm 4 Pixel Unshuffle YUV 4:2:0 Tensor} _{function pixel_unshuffle_yuv420_tensor(yuv420_tensor)} _{⊲ Assuming yuv420_tensor has shape [height, width, 3]} _{⊲ where the last dimension represents Y, U, and V channels} _{⊲ and the U and V channels have half the resolution of the Y channel} _{height, width, _ ← shape(yuv420_tensor)} _{⊲ Extract the Y, U, and V channels from the yuv420_tensor} _{y_channel ← yuv420_tensor[:, :, 0]} _{u_channel ← yuv420_tensor[::2, ::2, 1]} _{v_channel ← yuv420_tensor[::2, ::2, 2]} _{⊲ Apply pixel unshuffle to the Y channel with block size 4} _{unshuffled_y_tensor ← pixel_unshuffle(y_channel, block_size=4)} _{⊲ Apply pixel unshuffle to the U channel with block size 2} _{unshuffled_u_tensor ← pixel_unshuffle(u_channel, block_size=2)} _{⊲ Apply pixel unshuffle to the V channel with block size 2} _{unshuffled_v_tensor ← pixel_unshuffle(v_channel, block_size=2)} _{⊲ Stack the unshuffled Y, U, and V tensors along the channel dimension} _{output_tensor ← stack([unshuffled_y_tensor, unshuffled_u_tensor, unshuffled_v_tensor], axis=-1)} _{Return: output_tensor} _{end function} _{That is, the method may comprise receiving an input tensor in YUV 4:2:0 format, this may} _{be, for example, the format in the input data has been captured natively by an image capture} _{device, whereby the input tensor has a height dimension and width dimension representing x-y} _{coordinates in an image comprising a plurality of pixels, and whereby the input tensor has} _{three channels made up of a luma channel (Y) and two chroma channels (UV). The shape of} _{the tensor is identified and the Y, U and V channels extracted. A pixel unshuffle operation} _{with block size 4 is applied to the Y channel, a pixel unshuffle operation with block size 2 is} _{applies to the U channel, and a pixel unshuffle operation with block size 2 is applied to the V} _{channel. The resulting 16 channel Y tensor, 4 channel U tensor and 4 channel V tensor are} _{stacked to produce a 24 channel tensor of uniform shape which can be fed into the flow and/or} _{residual modules of the AI-based compression pipeline as before.} _{It will be appreciated that the above described block sizes and dimensions for the reshaping} _{operations (e.g. a 24 channel tensor of uniform shape and so on) are illustrative only, and other} _{dimensions and uniform tensor shapes are also envisaged.} _{Further, pixel unshuffle is only one reshaping operation that may be used. The above approach} _{may be generalised to any tensor reshaping operations that achieve the same effect. Thus the} _{above algorithm may be written in general pseudocode form as:}

_{Algorithm 5 Generalized Reshape and Permute} _{function reshape_and_permute_tensor(input_tensor, spatial_dims, block_sizes)} input_shape ← shape(input_tensor) num_spatial_dims ← length(spatial_dims) output_shape ← [] f_{or ^^ ← 1 to num_spatial_dims do} dim_index← spatial_dims[i] block_size ← block_sizes[i] o_{utput_shape.append(input_shape[dim_index] // block_size)} output_shape.append(block_size) e_{nd for} _{for ^^ ← 1 to length(input_shape) do} _{if ^^ not in spatial_dims then} output_shape.append(input_shape[i]) e_{nd if} _{end for} _{reshaped_tensor ← reshape(input_tensor, output_shape)} _{permute_dims ← [i for i in range(1, num_spatial_dims+1)] +} _{[i+num_spatial_dims for i in range(1, num_spatial_dims+1)] +} _{[i for i in range(1, length(output_shape)+1) if i not in permute_dims]} _{permuted_tensor ← permute(reshaped_tensor, permute_dims)} _{final_shape ← [output_shape[i] for i in range(num_spatial_dims)] +} _{[prod(block_sizes) * input_shape[-1]]} _{output_tensor ← reshape(permuted_tensor, final_shape)} _{Return: output_tensor} _{end function} _{That is, in general form, the following input information is provided: an input tensor of any} _{dimensions, with shape ^^1, ^^2, ..., ^^^^, channels, where ^^1, ^^2, ..., ^^^^ are the spatial dimensions} _{and channels is the channel dimension, a variable ^^^^^^^^^^^^^^^^^^^^^^ which is a a list specifying the} _{indices of the spatial dimensions to be reshaped, and the variable ^^^^^^^^^^^^^^^^^^^^ which is a list} _{specifying the block size for each spatial dimension.} _{The shape of the input tensor is determined and the number of spatial dimensions to be} _{reshaped is calculated. An empty list, ^^^^^^^^^^^^^^ℎ^^^^^^, is intialised to store the shape of the} _{intermediate reshaped tensor. We iterate over the spatial dimensions specified in ^^^^^^^^^^^^^^^^^^^^^^} _{and update the ^^^^^^^^^^^^^^ℎ^^^^^^ list by dividing each spatial dimension by its corresponding block} _{size and appending the block size as a new dimension. The remaining dimensions (non-spatial} _{dimensions) from the input tensor are appended to the ^^^^^^^^^^^^^^ℎ^^^^^^ list and we reshape the} _{input tensor according to the ^^^^^^^^^^^^^^ℎ^^^^^^ to obtain the ^^^^^^ℎ^^^^^^^^^^^^^^^^^^^^. We create a list} _{^^^^^^^^^^^^^^^^^^^^^^ to specify the permutation order for the dimensions of the ^^^^^^ℎ^^^^^^^^^^^^^^^^^^^^} _{and iterate over the spatial dimensions and append the corresponding dimension indices} _{to ^^^^^^^^^^^^^^^^^^^^^^, followed by the indices of the newly added block size dimensions. The} _{remaining dimension indices (non-spatial dimensions) are appended to ^^^^^^^^^^^^^^^^^^^^^^ and} _{we permute the dimensions of the ^^^^^^ℎ^^^^^^^^^^^^^^^^^^^^ according to ^^^^^^^^^^^^^^^^^^^^^^ to obtain the} _{^^^^^^^^^^^^^^^^^^^^^^^^^^^^. The ^^ ^^^^^^^^^^ℎ^^^^^^ is calculated by appending the spatial dimensions from} _{^^^^^^^^^^^^^^ℎ^^^^^^ and the product of the block sizes multiplied by the number of channels and the} _{^^^^^^^^^^^^^^^^^^^^^^^^^^^^ is reshaped according to the ^^ ^^^^^^^^^^ℎ^^^^^^ to obtain the ^^^^^^^^^^^^^^^^^^^^^^^^ . The} _{^^^^^^^^^^^^^^^^^^^^^^^^ is then returned.} _{The above generalised approach accordingly provides a framework for distributing pixels} _{from the spatial dimension into channel dimensions to increase the receptive window of the} _{subsequent convolution operations of the AI-based compression pipeline in a way that is} _{compute efficient.} _{The subject matter and the functional operations described in this specification can be} _{implemented in digital electronic circuitry, in tangibly-embodied computer software or} _{firmware, in computer hardware, including the structures disclosed in this specification and} _{their structural equivalents, or in combinations of one or more of them. The subject matter} _{described in this specification can be implemented as one or more computer programs, i.e.,} _{one or more modules of computer program instructions encoded on a tangible non transitory} _{program carrier for execution by, or to control the operation of, data processing apparatus.} _{Alternatively or in addition, the program instructions can be encoded on an artificially generated} _{propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that} _{is generated to encode information for transmission to suitable receiver apparatus for execution} _{by a data processing apparatus. The computer storage medium can be a machine-readable} _{storage device, a machine-readable storage substrate, a random or serial access memory device,} _{or a combination of one or more of them. The computer storage medium is not, however, a} _{propagated signal.} _{The term “data processing apparatus” encompasses all kinds of apparatus, devices, and} _{machines for processing data, including by way of example a programmable processor, a} _{computer, or multiple processors or computers. The apparatus can include special purpose} _{logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific} _{integrated circuit). The apparatus can also include, in addition to hardware, code that creates} _{an execution environment for the computer program in question, e.g., code that constitutes} _{processor firmware, a protocol stack, a database management system, an operating system, or} _{a combination of one or more of them.} _{A computer program (which may also be referred to or described as a program, software, a} _{software application, a module, a software module, a script, or code) can be written in any} _{form of programming language, including compiled or interpreted languages, or declarative or} _{procedural languages, and it can be deployed in any form, including as a stand alone program or} _{as a module, component, subroutine, or other unit suitable for use in a computing environment.} _{A computer program may, but need not, correspond to a file in a file system. A program can be} _{stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored} _{in a markup language document, in a single file dedicated to the program in question, or in} _{multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions} _{of code. A computer program can be deployed to be executed on one computer or on multiple} _{computers that are located at one site or distributed across multiple sites and interconnected by} _{a communication network.} _{The processes and logic flows described in this specification can be performed by one or more} _{programmable computers executing one or more computer programs to perform functions} _{by operating on input data and generating output. The processes and logic flows can also be} _{performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g.,} _{an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).} _{Computers suitable for the execution of a computer program include, by way of example,} _{can be based on general or special purpose microprocessors or both, or any other kind of} _{central processing unit. Generally, a central processing unit will receive instructions and data} _{from a read only memory or a random access memory or both. The essential elements of} _{a computer are a central processing unit for performing or executing instructions and one} _{or more memory devices for storing instructions and data. Generally, a computer will also} _{include, or be operatively coupled to receive data from or transfer data to, or both, one or more} _{mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.} _{However, a computer need not have such devices. Moreover, a computer can be embedded in} _{another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or} _{video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a} _{server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book} _{reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable} _{devices, that includes one or more processors and computer readable media, or a portable} _{storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.} _{Computer readable media suitable for storing computer program instructions and data include} _{all forms of non-volatile memory, media and memory devices, including by way of example} _{semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic} _{disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and} _{DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in,} _{special purpose logic circuitry.} _{The subject matter described in this specification can be implemented in a computing system} _{that includes a back end component, e.g., as a data server, or that includes a middleware} _{component, e.g., an application server, or that includes a front end component, e.g., a client} _{computer having a graphical user interface or a Web browser through which a user can interact} _{with an implementation of the subject matter described in this specification, or any combination} _{of one or more such back end, middleware, or front end components. The components of the} _{system can be interconnected by any form or medium of digital data communication, e.g., a} _{communication network. Examples of communication networks include a local area network} _{(“LAN”) and a wide area network (“WAN”), e.g., the Internet.} _{The computing system can include clients and servers. A client and server are generally remote} _{from each other and typically interact through a communication network. The relationship of} _{client and server arises by virtue of computer programs running on the respective computers} _{and having a client-server relationship to each other.} _{While this specification contains many specific implementation details, these should be} _{construed as descriptions of features that may be specific to particular examples of particular} _{inventions. Certain features that are described in this specification in the context of separate} _{examples can also be implemented in combination in a single example. Conversely, various} _{features that are described in the context of a single example can also be implemented in} _{multiple examples separately or in any suitable subcombination.} _{Similarly, while operations are depicted in the drawings in a particular order, this should not} _{be understood as requiring that such operations be performed in the particular order shown} _{or in sequential order, or that all illustrated operations be performed, to achieve desirable} _{results. In certain circumstances, multitasking and parallel processing may be advantageous.} _{Moreover, the separation of various system modules and components in the examples described} _{above should not be understood as requiring such separation in all examples, and it should be} _{understood that the described program components and systems can generally be integrated} _{together in a single software product or packaged into multiple software products.}

Claims

1. A method for lossy image or video encoding and transmission, and decoding, the method _{comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information} _{using the first image and the second image, the optical flow information being indicative of a} _{difference between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information;} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image;} _{wherein the first image and the second image comprise data arranged in a respective} _{plurality of image channels; and} _{wherein using the first image and the second image to produce a latent representation of} _{optical flow information comprises using a subset of the plurality of channels.} _{2. The method of claim 1, wherein the subset comprises a single channel of the plurality of} _{channels of each of the first image and the second image.} _{3. The method of claim 1 or 2, wherein the plurality of channels of each of the first image and} _{the second image comprise a luma channel and at least one chroma channel.}

_{4. The method of claim 3, wherein the subset comprises the luma channel.} _{5. The method of claim 4, wherein the luma channel and the at least one chroma channel are} _{defined in a YUV colour space.} _{6. The method of claim 5, wherein the at least one chroma channel has a different resolution to} _{the luma channel.} _{7. The method of claim 6, wherein the YUV colour space comprises a YUV 4:2:0 or YUV} _{4:2:2 colour space.} _{8. The method of any of claims 1 to 7, wherein using the first image and the second image to} _{produce a latent representation of optical flow information comprises:} _{producing a representation of optical flow information at a plurality of resolutions and} _{using the representation of optical flow information at the plurality of resolutions to produce} _{said latent representation of optical flow information.} _{9. The method of claim 8, wherein a representation of optical flow information at a first} _{resolution of said plurality of resolutions is based on a representation of optical flow information} _{at a second resolution of said plurality of resolutions.} _{10. The method of any of claims 8 to 9, wherein using the first image and the second image to} _{produce a latent representation of optical flow information comprises:} _{using the representation of optical flow information at one of said plurality of resolutions} _{to warp a representation of the first image at a different resolution of said plurality of resolutions.}

_{11. The method of claim 10, wherein a representation of optical flow information at a different} _{one of said plurality of resolutions is based on the warped first image and the second image at} _{one of said plurality of resolutions.} _{12. The method of any of claims 10 to 11, wherein the representation of optical flow information} _{is estimated using the subset of the plurality of channels, and wherein the warping is performed} _{using the plurality of channels.} _{13. A method for lossy image or video encoding and transmission, the method comprising the} _{steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information} _{using the first image and the second image, the optical flow information being indicative of a} _{difference between the first image and the second image;} _{wherein the first image and the second image comprise data arranged in a plurality of} _{image channels; and} _{wherein using the first image and the second image to produce a latent representation of} _{optical flow information comprises using a subset of the plurality of channels.} _{14. A method for lossy image or video receipt and decoding, the method comprising the steps} of: r_{eceiving a latent representation of optical flow information at a second computer system,} _{the optical flow information being indicative of a difference between a first image and a second} _{image each comprising data arranged in a plurality of image channels, the latent representation} _{of optical flow information being produced with first neural network using a subset of the} _{plurality of channels;} _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.} _{15. A data processing apparatus configured to perform the method of any of claims 1 to 14.} _{16. A computer program comprising instructions which, when the program is executed by a} _{computer, cause the computer to carry out the method of any of claims 1 to 14.} _{17. A computer-readable storage medium comprising instructions which, when executed by a} _{computer, cause the computer to carry out the method of any of claims 1 to 14.} _{18. A method for lossy image or video encoding and transmission, and decoding, the method} _{comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information} _{using the first image and the second image, the optical flow information being indicative of a} _{difference between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image;} _{wherein the first image and the second image comprise data arranged in respective luma} _{channels and chroma channels; and} _{wherein using the first image and the second image to produce a latent representation of} _{optical flow information comprises:} _{producing a representation of the optical flow information based on the respective} _{luma channels of the first image and the second image; and} _{downsampling the representation of the optical flow information.} _{19. The method of claim 18, wherein said downsampling comprises downsamping the} _{representation of the optical flow information to a resolution of the respective chroma channels} _{of the first image and the second image.} _{20. The method of claim 18 or 19, comprising warping the data in the respective chroma} _{channels using the downsampled representation of the optical flow information.} _{21. The method of any of claims 18 to 20, comprising warping the data in the respective luma} _{channels using the downsampled representation of the optical flow information.} _{22. The method of any of claims 18 to 21, wherein using the first image and the second image} _{to produce a latent representation of optical flow information comprises:} _{producing a representation of optical flow information from the respective luma channels} _{at a plurality of resolutions and using the representation of optical flow information at the} _{plurality of resolutions to produce said latent representation of optical flow information.} _{23. The method of claim 22, wherein a representation of optical flow information at a} _{first resolution of said plurality of resolutions is based on a representation of optical flow} _{information at a second resolution of said plurality of resolutions.} _{24. The method of claim 23, when dependent on claim 20 or 21 wherein the representation of} _{optical flow information at one or more of said plurality of resolutions is based on said warped} data. _{25. The method of any of claims 18 to 24, wherein the respective luma channels and chroma} _{channels are defined in a YUV colour space.} _{26. The method of any of claim 25, wherein the respective chroma channels have a different} _{resolution to the luma channel.} _{27. The method of claim 26, wherein the YUV colour space comprises a YUV 4:2:0 or YUV} _{4:2:2 colour space.} _{28. A method for lossy image or video encoding and transmission, the method comprising the} _{steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information} _{using the first image and the second image, the optical flow information being indicative of a} _{difference between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{herein the first image and the second image comprise data arranged in respective luma} _{channels and chroma channels; and} _{wherein using the first image and the second image to produce a latent representation of} _{optical flow information comprises:} _{producing a representation of the optical flow information based on the respective} _{luma channels of the first image and the second image; and} _{downsampling the representation of the optical flow information.} _{29. A method for lossy image or video receipt and decoding, the method comprising the steps} of: r_{eceiving a latent representation of optical flow information to a second computer system,} _{the optical flow information being indicative of a difference between a first image and a second} _{image each comprising data arranged in respective luma channels and chroma channels, the} _{latent representation of optical flow information being produced with a first neural network} _{using a downsampled representation of the optical flow information based on the respective} _{luma channels of the first image and the second image;} _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.}

_{30. A data processing apparatus configured to perform the method of any of claims 18 to 29.} _{31. A computer program comprising instructions which, when the program is executed by a} _{computer, cause the computer to carry out the method of any of claims 18 to 29.} _{32. A computer-readable storage medium comprising instructions which, when executed by a} _{computer, cause the computer to carry out the method of any of claims 18 to 29.} _{33. A method for lossy image or video encoding and transmission, and decoding, the method} _{comprising the steps of:} _{receiving a first image and a second image at a first computer system, the first image and} _{the second image comprising data arranged in a respective plurality of image channels;} _{transforming the data by distributing information of the image channels from a spatial} _{dimension into a channel dimension;} _{with a first neural network, producing a latent representation of optical flow information} _{using the transformed data, the optical flow information being indicative of a difference} _{between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.}

_{34. The method of claim 33, wherein the information comprises pixel values of said plurality} _{of image channels.} _{35. The method of claim 33 or 34, wherein the plurality of image channels comprises a subset} _{of channels in a first spatial resolution different to a spatial resolution of the other channels in} _{the plurality of image channels.} _{36. The method of any of claims 33 to 35, wherein the transformed data comprises a plurality} _{of image channels each having a same spatial resolution.} _{37. The method of claim 36, wherein said same spatial resolution is lower than the spatial} _{resolution of said subset of channels.} _{38. The method of claim 36 or 37, wherein said same spatial resolution is lower than the} _{spatial resolution of said other channels of the plurality of channels,} _{39. The method of any of claims 33 to 38, wherein the data comprises 3-channel YUV data and} _{wherein said transforming comprises transforming the 3-channel YUV data into 24-channel} data. _{40. The method of claim 39, wherein the YUV data comprises one of 4:2:0 YUV data or 4:2:2} _{YUV data.} _{41. The method of any of claims 33 to 40, wherein said transforming comprises performing a} _{pixel unshuffle operation on the data.}

_{42. The method of claim 41, wherein the pixel unshuffle operation is defined by a first block} _{size for a first image channel of the data, and defined by a second block size for a second image} _{channel of the data .} _{43. The method of any of claims 33 to 40, wherein said transforming comprises performing a} _{convolution operation on the data.} _{44. The method of claim 43, wherein the convolution operation is defined by a first stride for a} _{first image channel of the data, and defined by a second stride for a second image channel of} _{the data.} _{45. The method of any of claims 33 to 40, wherein said transforming comprises upsampling a} _{subset of said plurality of image channels to produce a plurality of image channels each having} _{a same spatial resolution.} _{46. The method of any of claims 33 to 45, wherein one or more of the first, second or third} _{neural networks is defined by a convolution operation, and wherein said transforming increases} _{a receptive field of the convolution operation.} _{47. A method for lossy image or video encoding and transmission, the method comprising the} _{steps of:} _{receiving a first image and a second image at a first computer system, the first image and} _{the second image comprising data arranged in a respective plurality of image channels;} _{transforming the data by distributing information of the image channels from a spatial} _{dimension into a channel dimension;} _{with a first neural network, producing a latent representation of optical flow information} _{using the transformed data, the optical flow information being indicative of a difference} _{between the first image and the second image;} _{transmitting the latent representation of optical flow information to a second computer} system. _{48. A method for lossy image or video receipt and decoding, the method comprising the steps} of: r_{eceiving a first image and a second image at a first computer system, the first image and} _{the second image comprising data arranged in a respective plurality of image channels;} _{transforming the data by distributing information of the image channels from a spatial} _{dimension into a channel dimension;} _{with a first neural network, producing a latent representation of optical flow information} _{using the transformed data, the optical flow information being indicative of a difference} _{between the first image and the second image;} _{receiving a latent representation of optical flow information at a second computer system,} _{the ;} _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.}

_{49. A method for lossy image or video receipt and decoding, the method comprising the steps} of: r_{eceiving a latent representation of optical flow information at a second computer system,} _{the optical flow information being indicative of a difference between a first image and a second} _{image each comprising data arranged in a plurality of image channels, the latent representation} _{of optical flow information being produced with a first neural network and by transforming the} _{data by distributing information of the image channels from a spatial dimension into a channel} dimension; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information; and} _{with a third neural network, producing an output image using the optical flow information,} _{wherein the output image is an approximation of the first image.} _{50. A data processing apparatus configured to perform the method of any of claims 33 to 49.} _{51. A computer program comprising instructions which, when the program is executed by a} _{computer, cause the computer to carry out the method of any of claims 33 to 49.} _{52. A computer-readable storage medium comprising instructions which, when executed by a} _{computer, cause the computer to carry out the method of any of claims 33 to 49.}