WO2025119707A1

WO2025119707A1 - Method and data processing system for lossy image or video encoding, transmission and decoding

Info

Publication number: WO2025119707A1
Application number: PCT/EP2024/083641
Authority: WO
Inventors: Hamza ALAWIYE; Lucas DEECKE; Bilal ABBASI; Arsalan ZAFAR; Christopher FINLAY; Christian ETMANN; Sebastjan CIZEL; Aaron BERK; Aleksander CHERGANSKI
Original assignee: Deep Render Ltd
Current assignee: Deep Render Ltd
Priority date: 2023-12-04
Filing date: 2024-11-26
Publication date: 2025-06-12
Anticipated expiration: 2026-06-04

Abstract

A method of training one or more neural networks for use in lossy image or video encoding, transmission and decoding. The method comprises receiving an input image at a first computer system; downsampling the input image with a downsampler to produce a downsampled input image; encoding the downsampled input image using a first neural network to produce a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; upsampling the output image with an upsampler to produce an upsampled output image; evaluating a function based on a difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image.

Description

_{Method and data processing system for lossy image or video} _{encoding, transmission and decoding} BACKGROUND _{This invention relates to a method and system for lossy image or video encoding, transmission} _{and decoding, a method, apparatus, computer program and computer readable storage medium} _{for lossy image or video encoding and transmission, and a method, apparatus, computer} _{program and computer readable storage medium for lossy image or video receipt and decoding.} _{There is increasing demand from users of communications networks for images and video} _{content. Demand is increasing not just for the number of images viewed, and for the playing} _{time of video; demand is also increasing for higher resolution content. This places increasing} _{demand on communications networks and increases their energy use because of the larger} _{amount of data being transmitted.} _{To reduce the impact of these issues, image and video content is compressed for transmission} _{across the network. The compression of image and video content can be lossless or lossy} _{compression. In lossless compression, the image or video is compressed such that all of the} _{original information in the content can be recovered on decompression. However, when using} _{lossless compression there is a limit to the reduction in data quantity that can be achieved. In} _{lossy compression, some information is lost from the image or video during the compression} _{process. Known compression techniques attempt to minimise the apparent loss of information} _{by the removal of information that results in changes to the decompressed image or video that} _{is not particularly noticeable to the human visual system. JPEG, JPEG2000, AVC, HEVC and} _{AVI are examples of compression processes for image and/or video files.} _{In general terms, known lossy image compression techniques use the spatial correlations} _{between pixels in images to remove redundant information during compression. For example,} _{in an image of a blue sky, if a given pixel is blue, there is a high likelihood that the neighbouring} _{pixels, and their neighbouring pixels, and so on, are also blue. There is accordingly no need to} _{retain all the raw pixel data. Instead, we can retain only a subset of the pixels which take up} _{fewer bits and infer the pixel values of the other pixels using information derived from spatial} correlations. _{A similar approach is applied in known lossy video compression techniques. That is, spatial} _{correlations between pixels allow the removal of redundant information during compression.} _{However, in video compression, there is further information redundancy in the form of temporal} _{correlations. For example, in a video of an aircraft flying across a blue-sky background, most} _{of the pixels of the blue sky do not change at all between frames of the video. The most} _{of the blue sky pixel data for the frame at position t = 0 in the video is identical to that at} _{position t = 10. Storing this identical, temporally correlated, information is inefficient. Instead,} _{only the blue sky pixel data for a subset of the frames is stored and the rest are inferred from} _{information derived from temporal correlations.} _{In the realm of lossy video compression in particular, the removal of redundant temporally} _{correlated information in a video sequence is as known inter-frame redundancy.} _{One technique using inter-frame redundancy that is widely used in standard video compression} _{algorithms involves the categorization of video frames into three types: I-frames, P-frames, and} _{B-frames. Each frame type carries distinct properties concerning their encoding and decoding} _{process, playing different roles in achieving high compression ratios while maintaining} _{acceptable visual quality.} _{I-frames, or intra-coded frames, serve as the foundation of the video sequence. These frames} _{are self-contained, each one encoding a complete image without reference to any other frame.} _{In terms of compression, I-frames are least compressed among all frame types, thus carrying} _{the most data. However, their independence provides several benefits, including being the} _{starting point for decompression and enabling random access, crucial for functionalities like} _{fast-forwarding or rewinding the video.} _{P-frames, or predictive frames, utilize temporal redundancy in video sequences to achieve} _{greater compression. Instead of encoding an entire image like an I-frame, a P-frame represents} _{the difference between itself and the closest preceding I- or P-frame. The process, known as} _{motion compensation, identifies and encodes only the changes that have occurred, thereby} _{significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on} _{previous frames for decoding. Consequently, any error during the encoding or transmission} _{process may propagate to subsequent frames, impacting the overall video quality.} _{B-frames, or bidirectionally predictive frames, represent the highest level of compression.} _{Unlike P-frames, B-frames use both the preceding and following frames as references in their} _{encoding process. By predicting motion both forwards and backwards in time, B-frames} _{encode only the differences that cannot be accurately anticipated from the previous and next} _{frames, leading to substantial data reduction. Although this bidirectional prediction makes} _{B-frames more complex to generate and decode, it does not propagate decoding errors since} _{they are not used as references for other frames.} _{Artificial intelligence (AI) based compression techniques achieve compression and decom-} _{pression of images and videos through the use of trained neural networks in the compression} _{and decompression process. Typically, during training of the neutral networks, the difference} _{between the original image and video and the compressed and decompressed image and video} _{is analyzed and the parameters of the neural networks are modified to reduce this difference} _{while minimizing the data required to transmit the content. However, AI based compression} _{methods may achieve poor compression results in terms of the appearance of the compressed} _{image or video or the amount of information required to be transmitted.} _{An example of an AI based image compression process comprising a hyper-network is described} _{in Ballé, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv} _{preprint arXiv:1802.01436 (2018), which is hereby incorporated by reference.} _{An example of an AI based video compression approach is shown in Agustsson, E., Minnen, D.,} _{Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end} _{optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer} _{Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference.} _{A further example of an AI based video compression approach is shown in Mentzer, F.,} _{Agustsson, E., Ballé, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural} _{video compression using gans for detail synthesis and propagation. In Computer Vision–ECCV} _{2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part} _{XXVI (pp. 562-578), which is hereby incorporated by reference. Figure 3 of which shows an} _{architecture that calculates optical flow with a flow model, UFlow, and encodes the calculated} _{optical flow with a flow encoder, Eflow.} SUMMARY _{According to an aspect of the present disclosure, there is provided a method of training one or} _{more neural networks for use in lossy image or video encoding, transmission and decoding.} _{The method comprises receiving an input image at a first computer system; downsampling} _{the input image with a downsampler to produce a downsampled input image; encoding the} _{downsampled input image using a first neural network to produce a latent representation;} _{decoding the latent representation using a second neural network to produce an output image,} _{wherein the output image is an approximation of the input image; upsampling the output image} _{with an upsampler to produce an upsampled output image; evaluating a function based on a} _{difference between one or more of: the output image and the input image, the output image} _{and the downsampled input image, the upsampled output image and the input image, and/or} _{the upsampled output image and the downsampled input image; updating the parameters of} _{the first neural network and the second neural network based on the evaluated function; and} _{repeating the above steps using a first set of input images to produce a first trained neural} _{network and a second trained neural network.} _{Optionally, the method described above may comprise a third neural network for upsampling,} _{and wherein the method may include updating the parameters of said third neural network} _{based on the evaluated function.} _{Optionally, the method described above may comprise a downsampler configured for either} _{bilinear or bicubic downsampling.} _{Optionally, the method described above may comprise a Gaussian blur filter in the downsampler.} _{Optionally, method according to any one as described above may comprise (i) updating the} _{parameters of the first neural network and the second neural network based on the evaluated} _{function for a first number of steps to produce the first and second trained neural networks} _{without performing said upsampling and downsampling and without updating the parameters} _{of the third neural network, (ii) freezing the parameters of the first and second neural networks} _{after said first number of steps, and performing said upsampling and downsampling, and said} _{updating of the parameters of the third neural network for a second number of said steps.} _{Optionally, the method described above may comprise a fourth neural network in the down-} _{sampler, and may further include updating the parameters of said fourth neural network based} _{on the evaluated function.} _{Optionally, the method described above may comprise an upsampler configured for either} _{bilinear or bicubic upsampling.} _{Optionally, method as described above may comprise (i) updating the parameters of the first} _{neural network and the second neural network based on the evaluated function for a first number} _{of said steps to produce the first and second trained neural networks without performing said} _{upsampling and downsampling and without updating the parameters of the fourth neural} _{network, (ii) freezing the parameters of the first and second neural networks after said number} _{first of steps, and performing said downsampling and said updating of the parameters of the} _{fourth neural network for a second number of said steps.} _{Optionally, the method described above may comprise entropy encoding the latent representation} _{into a bitstream having a length, wherein the function is further based on said bitstream length,} _{and wherein said updating the parameters of the third or fourth neural network is based on the} _{evaluated function based on the bitstream length.} _{Optionally, the method described above may comprise determining the difference between one} _{or more of the output image and the input image, the output image and the downsampled input} _{image, the upsampled output image and the input image, and/or the upsampled output image} _{and the downsampled input image based on the output of a fifth neural network acting as a} discriminator. _{Optionally, the method of any as described above may comprise calculating the difference} _{between one or more of: the output image and the input image, the output image and the} _{downsampled input image, the upsampled output image and the input image, and/or the} _{upsampled output image and the downsampled input image. The difference is expressed in} _{terms of a mean squared error (MSE) and/or a structural similarity index measure (SSIM).} _{Optionally, the method described above may comprise a term defining a visual perceptual} metric. _{Optionally, the method described above may comprise a visual perceptual metric, wherein the} _{term defining the metric comprises a MS-SIM metric.} _{According to an aspect of the present disclosure, there is provided a method of training one} _{or more neural networks, the one or more neural networks being for use in lossy image or} _{video encoding, transmission and decoding. The method comprises receiving an input image} _{at a first computer system, encoding the input image using a first neural network to produce} _{a latent representation, decoding the latent representation using a second neural network to} _{produce an output image, wherein the output image is an approximation of the input image,} _{upsampling the output image with an upsampler to produce an upsampled output image, the} _{upsampler comprising a third neural network, evaluating a function based on a difference} _{between one or both of: the output image and the input image, and/or the upsampled output} _{image and the input image, updating the parameters of the third neural network based on the} _{evaluated function, and repeating the above steps using a first set of input images to produce a} _{first trained neural network and a second trained neural network.} _{According to an aspect of the present disclosure, there is provided a method of training one or} _{more neural networks for use in lossy image or video encoding, transmission and decoding.} _{The method comprises receiving an input image at a first computer system, downsampling the} _{input image with a downsampler comprising a fourth neural network to produce a downsampled} _{input image, encoding the downsampled input image using a first neural network to produce} _{a latent representation, decoding the latent representation using a second neural network to} _{produce an output image approximating the input image. Further steps involve evaluating a} _{function based on differences between various images and updating the parameters of the} _{fourth neural network based on the evaluated function. This process is repeated using a first set} _{of input images to produce a first trained neural network and a second trained neural network.} _{Optionally, the method described above may comprise producing the previously downsampled} _{input image by performing bilinear or bicubic downsampling on the input image.} _{According to an aspect of the present disclosure, there is provided a method for lossy image or} _{video encoding, transmission and decoding. The method comprises the steps of receiving an} _{input image at a first computer system, downsampling the input image with a downsampler,} _{encoding the downsampled input image using a first trained neural network to produce a latent} _{representation, transmitting the latent representation to a second computer system, decoding} _{the latent representation using a second trained neural network to produce an output image,} _{wherein the output image is an approximation of the input image, and upsampling the output} _{image with an upsampler to produce an upsampled output image.} _{According to an aspect of the present disclosure, there is provided a method for lossy image or} _{video encoding, transmission and decoding, the method comprising the steps of receiving an} _{input image at a first computer system; encoding the input image using a first trained neural} _{network to produce a latent representation; transmitting the latent representation to a second} _{computer system; and decoding the latent representation using a second trained neural network} _{to produce an output image, wherein the output image is an approximation of the input image;} _{wherein one or more of the above steps comprises performing a downsampling or upsampling} _{operation, and wherein the downsampling or upsampling operation comprises performing one} _{or convolution operations without performing a space-to-depth or depth-to-space operation.} _{Optionally, the method described above may comprise performing the downsampling or} _{upsampling operation on a CPU without performing a space-to-depth or depth-to-space} _{operation, and wherein said downsampling or upsampling is configured to be performed in} _{real-time or near real-time.} _{The method as described above may optionally comprise performing the downsampling} _{or upsampling operation on a neural accelerator without performing a space-to-depth or} _{depth-to-space operation, and wherein said downsampling or upsampling is configured to be} _{performed in real-time or near real-time.} _{The method as described above may optionally comprise a downsampling operation that} _{includes applying one or more convolutional layers with a kernel size based on a downsampling} _{factor. These convolutional layers are configured to sequentially reduce the spatial dimensions} _{of an input while increasing the depth or channel dimension of the input.} _{Optionally, the method described above may comprise an input image.} _{Optionally, the method may comprise a tensor representation of the input image.} _{Optionally, the method described above may comprise a downsampling operation performed by} _{applying one or more convolutional layers configured with a stride equal to the downsampling} _{factor. The number of filters in each convolutional layer is based on the original number of} _{channels in the input tensor and on the downsampling factor.} _{Optionally, the method may include performing a first convolution and second deconvolution.} _{This may further involve performing additional upsampling steps and utilizing additional layers} _{such as Maxpool and Relu.} _{Optionally, the method described above may comprise the input including a latent representation.} _{Optionally, the method may comprise a tensor representation of the latent representation or the} _{output image as part of its input.} _{Optionally, the method as described may include upsampling layers having strides determined} _{by an upsampling factor.} _{Optionally, the method described above may further comprise applying an activation function} _{after each convolutional layer in the upsampling operation.} _{Optionally, the method described above may include the upsampling layers being selected} _{from a group consisting of nearest neighbor upsampling, bilinear upsampling, and bicubic} _{upsampling, alternated with the convolutional layers.} _{According to an aspect of the present disclosure, there is provided a method for lossy image} _{or video encoding, transmission and decoding, comprising the steps of receiving an input} _{image and a second image at a first computer system; estimating optical flow information} _{using the second image and the input image using a first neural network, wherein the optical} _{flow information is indicative of a difference between a representation of the second image} _{and a representation of the input image; transmitting the optical flow information to a second} _{computer system; decoding the optical flow information using a second neural network; and} _{using the second image and the decoded optical flow information to produce an output image,} _{wherein the output image is an approximation of the input image. The estimating of optical flow} _{information further comprises estimating differences between the input image and the second} _{image by applying a first convolution operation on a one or more pixels of a representation of} _{the input image and/or on one or more pixels of a representation of the second image, wherein} _{the convolution operation comprises applying one or more filters comprising weights having} _{values randomly distributed between a minimum and a maximum value.} _{Optionally, the method comprises estimating a compressively encoded cost volume indicative} _{of said differences by applying said first convolution operation.} _{Optionally, the first convolution substantially preserves a norm of a distribution of the respective} _{pixels of the representation of the input image and/or respective pixels of the representation of} _{the second image.} _{Optionally, a distribution of values of pixels of the representation of the input image and/or the} _{distribution of values of pixels of the representation of the second image are sparse distributions} _{in a spatial domain of the representation of the input image and/or the second image.} _{Optionally, a distribution of values of pixels of the representation of the input image and/or} _{the distribution of values of pixels of the representation of the second image comprise sparse} _{distributions in a spatial domain.} _{Optionally, the method described above may comprise assigning weights with values distributed} _{according to a sub-Gaussian distribution.} _{Optionally, the method described above may comprise determining the minimum value and/or} _{maximum value based on the number of channels of the input image and/or second image,} _{kernel size of the first convolution operation, and/or pixel radius across which said differences} _{are estimated.} _{Optionally, the method described above may comprise performing a second convolution} _{operation on an output of the first convolution operation, wherein the second convolution} _{operation substantially preserves a norm of a distribution of said output of the first convolution} operation. _{Optionally, the method described above may comprise estimating a difference between an} _{output of the second convolution operation and an output of the first convolution operation.} _{Optionally, the difference may comprise an absolute difference.} _{Optionally, the difference defines a cost volume.} _{Optionally, the method described above may comprise using the optical flow information to} _{warp a representation of the second image.} _{Optionally, the method may involve estimating a difference between the warped second image} _{and the input image in order to create a residual representation of the input image relative to} _{the warped second image.} _{Optionally, the method described above may comprise: (i) using a third neural network to} _{encode the residual representation of the input image; (ii) transmitting the encoded residual} _{representation of the input image to the second computer system; (iii) using a fourth neural} _{network to decode the residual representation of the input image; and (iv) using decoded the} _{residual representation of the input image to produce said output image.} _{Optionally, the method described above may comprise applying a third convolution operation} _{to an output of the first convolution operation and/or to an output of the second convolution} operation. _{Optionally, a kernel size of the second convolution operation is greater than a kernel size of} _{the first convolution operation.} _{Optionally, the first convolution operation is defined by a 1x1 kernel.} _{Optionally, the second convolution operation is defined by a 3x3 kernel.} _{Optionally, the third convolution operation is defined by a 1x1 kernel.} _{Optionally, the method described above may comprise performing the second convolution} _{operation to entangle information associated with respective pixels of the representation of} _{the input image with information associated with pixels adjacent corresponding pixels in the} _{representation of the second image.} _{Optionally, the method described above may comprise a first, second, and where present third} _{convolution operation, wherein these operations are performed without group convolutions.} _{Optionally, one or more outputs of the first, second and/or third convolution operation are} _{stored in contiguous memory blocks, and wherein estimating a difference comprises retrieving} _{said stored outputs from said contiguous memory blocks.} _{Optionally, a distribution of pixel values of the input image and of the second image are sparse} _{and incoherent in a spatial domain and/or a transform of a spatial domain.} _{According to an aspect of the present disclosure, there is provided a system configured to} _{perform any of the above methods.} _{According to an aspect of the present disclosure, there is provided a method for lossy image} _{or video encoding and transmission. The method includes receiving an input image and a} _{second image at a first computer system; estimating optical flow information using the second} _{image and the input image using a first neural network, wherein the optical flow information is} _{indicative of a difference between a representation of the second image and a representation of} _{the input image; transmitting the optical flow information to a second computer system. In this} _{method, estimating the optical flow information comprises estimating differences between the} _{input image and the second image by by applying a first convolution operation on a one or more} _{pixels of a representation of the input image and/or on one or more pixels of a representation} _{of the second image, wherein the convolution operation comprises applying one or more filters} _{comprising weights having values randomly distributed between a minimum and a maximum} value. _{According to an aspect of the present disclosure, there is provided a method for lossy image or} _{video decoding, comprising receiving an input image and a second image at a first computer} _{system; estimating optical flow information using the second image and the input image} _{using a first neural network, wherein the optical flow information is indicative of a difference} _{between a representation of the second image and a representation of the input image based} _{on a compressively encoded cost volume; receiving optical flow information at a second} _{computer system, wherein the optical flow information is indicative of a difference between a} _{representation of a second image and a representation of an input image; decoding the optical} _{flow information using a second neural network; and using the second image and the decoded} _{optical flow information to produce an output image, which approximates the input image.} _{According to an aspect of the present disclosure, there is provided an apparatus configured to} _{perform any of the above methods.} _{According to an aspect of the present disclosure, there is provided a method for estimating a} _{difference between a first image and a second image. The method comprises performing a} _{first convolution operation on respective pixels of a representation of the first image and on} _{respective pixels of a representation of the second image; and estimating a difference between} _{the first image and the second image based on one or more outputs of the first convolution} _{operation on the first and second images by estimating a compressively encoded cost volume} _{indicative of said differences.} _{Optionally, the method may include performing a second convolution operation on an output} _{of the first convolution operation, and estimating a difference between an output of the second} _{convolution operation and the first convolution operation.} _{Optionally, the method described above may comprise performing a second convolution} _{operation that entangles information associated with respective pixels of the representation} _{of the first with information associated with pixels adjacent corresponding pixels in the} _{representation of the second image.} _{Optionally, the method described above may include a first convolution operation where one or} _{more filters are applied with weights having values randomly distributed between a minimum} _{value and a maximum value.} _{Optionally, the method described above may comprise determining the minimum value and/or} _{maximum value based on the number of channels of the input image and/or second image,} _{kernel size of the first convolution operation, and pixel radius across which said differences are} estimated. _{Optionally, the method described above may comprise a difference comprising an absolute} difference. _{Optionally, the method described above may comprise defining a cost volume based on the} difference. _{Optionally, the method described above may comprise applying a third convolution operation} _{to an output of the first convolution operation and/or to an output of the second convolution} operation. _{Optionally, the method described may involve adjusting a kernel size of the second convolution} _{operation to be larger than that of the first.} _{Optionally, the method described above may comprise a first convolution operation defined by} _{a 1x1 kernel.} _{Optionally, the method described above may include the step whereby the second convolution} _{operation is defined by a 3x3 kernel.} _{Optionally, the method described above may comprise the third convolution operation defined} _{by a 1x1 kernel.} _{Optionally, the method described above may comprise storing a plurality of respective outputs} _{of the first, second and/or third convolution operations in contiguous memory blocks, and} _{wherein estimating a difference comprises retrieving said stored outputs from said contiguous} _{memory blocks.} _{Optionally, the method described above may comprise using said difference to identify one or} _{more pixel patches in the second image as movement-containing pixel patches, and generating} _{a bounding box around one or more of said movement-containing pixel patches.} _{According to an aspect of the present disclosure, there is provided a data processing apparatus} _{configured to perform any of the above described methods.} _{According to a method of the present disclosure, there is provided a method of training one or} _{more neural networks, the one or more neural networks being for use in lossy image or video} _{encoding, transmission and decoding, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information,} _{the optical flow information being indicative of a difference between the first image and the} _{second image;} _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information;} _{with a third neural network, producing an output image using the optical flow information} _{and using a representation of the first image, wherein the output image is an approximation of} _{the first image;} _{evaluating a function based on a difference between the first image and the second image,} _{the function comprising a Jacobian penalty term,} _{updating the parameters of the first, second and/or third neural networks based on the} _{evaluated function; and} _{repeating the above steps using a first set of input images to produce first, second and/or} _{third trained neural networks.} _{Optionally, the Jacobian penalty term is based on a rate of change of one or more first variables} _{with respect to one or more second variables, the first variables and second variables selected} _{from inputs and/or outputs associated with the one or more neural networks.} _{Optionally, at least input and/or output associated with the one or more neural networks is both} _{a first variable and a second variable.} _{Optionally, the method comprises producing the second variables from the first variables by} _{mapping the first variables to the second variables.} _{Optionally, the mapping is defined by an auxiliary function.} _{Optionally, the first variables are inputs to the auxiliary function and the second variables are} _{outputs of the auxiliary function.} _{Optionally, at least one input of said inputs to the auxiliary function is also an output of the} _{auxiliary function.} _{Optionally, the inputs of said mapping are defined in an input space, and the outputs of said} _{mapping are defined in an output space, and wherein the auxiliary function maps the input} _{space to the output space.} _{Optionally, the input space matches the output space.} _{Optionally, the auxiliary function is based on the third neural network.} _{Optionally, the third neural network comprises a residual decoder neural network.} _{Optionally, the at least one input to the auxiliary function that is also an output of the auxiliary} _{function comprises said latent representation of the first image.} _{Optionally, the method comprises weighting the Jacobian penalty term.} _{Optionally, said weighting is based on a difference between the first image and the second} image. _{Optionally, said weighting is defined by a weighted norm based on a matrix associated with} _{said rate of change.} _{Optionally, the method comprises estimating the Jacobian penalty term by approximating a} _{norm of a matrix associated with said rate of change.} _{Optionally, approximating the norm of the matrix comprises making a single sample approxi-} mation. _{Optionally, the method comprises introducing the Jacobian penalty term into said function} _{after a first number of said repeated steps.} _{Optionally, said first number of said repeated steps is based on a GOP-size of one or more} _{frame sequences in said first set of input images.} _{According to an aspect of the present disclosure, there is provided, a method of performing} _{lossy image or video encoding, transmission and decoding, the method comprising the steps} of: r_{eceiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information,} _{the optical flow information being indicative of a difference between the first image and the} _{second image;} _{transmitting the latent representation of optical flow information to a second computer} system; _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information;} _{with a third neural network, producing an output image using the optical flow information} _{and using a representation of the first image, wherein the output image is an approximation of} _{the first image;} _{wherein the first neural network, the second neural network, and the third neural network} _{are produced according to any of the methods described above.} _{According to an aspect of the present disclosure, there is provided, a method of performing} _{lossy image or video encoding, transmission, the method comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information,} _{the optical flow information being indicative of a difference between the first image and the} _{second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{herein the first neural network,is produced according to any of the methods described} above. _{According to an aspect of the present disclosure, there is provided, a method of performing} _{lossy image decoding, the method comprising the steps of:} _{receiving a latent representation of optical flow information at a second computer system;,} _{the optical flow information being indicative of a difference between a first image and a second} image; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information;} _{with a third neural network, producing an output image using the optical flow information} _{and using a representation of the first image, wherein the output image is an approximation of} _{the first image;} _{wherein the second neural network and the third neural network are produced according} _{to any of the methods described above.} _{According to an aspect of the present disclosure, there is provided, a method of performing} _{lossy image or video decoding, the method comprising the steps of:} _{with a second neural network, at a second computer system, decoding a latent represen-} _{tation to produce a first output image, wherein the first output image is an approximation of} _{one image of an image pair of a first sequence of input images;} _{repeating the above step to produce a first sequence of output images, the first sequence} _{of output images being an approximation of the first sequence of input images;} _{wherein the second neural network is produced according to any of the methods described} above. _{According to an aspect of the present disclosure, there is provided, a data processing apparatus} _{configured to perform any of the above methods.} _{According to an aspect of the present disclosure, there is provided, a computer program} _{comprising instructions which, when the program is executed by a computer, cause the} _{computer to carry out the method of any of the above methods.} _{According to an aspect of the present disclosure, there is provided, a computer-readable storage} _{medium comprising instructions which, when executed by a computer, cause the computer} _{carry out the method of any of the above methods.} _{BRIEF DESCRIPTION OF THE DRAWINGS} _{Aspects of the invention will now be described by way of examples, with reference to the} _{following figures in which:} _{Figure 1 illustrates an example of an image or video compression, transmission and decom-} _{pression pipeline.} _{Figure 2 illustrates a further example of an image or video compression, transmission and} _{decompression pipeline including a hyper-network.} _{Figure 3 illustrates an example of a video compression, transmission and decompression} pipeline. _{Figure 4 illustrates an example of a video compression, transmission and decompression} system. _{Figure 5 illustrates an example of an image or video compression, transmission and decom-} _{pression pipeline.} _{Figure 6 illustrates an example of an image or video compression, transmission and decom-} _{pression pipeline.} _{Figure 7 illustrates an example of an image or video compression, transmission and decom-} _{pression pipeline.} _{Figure 8 illustrates an example of an image or video compression, transmission and decom-} _{pression pipeline.} _{Figure 9 illustratively shows an example sequence of layers of an image or video compression,} _{transmission and decompression pipeline.} _{Figure 10 illustratively shows an example sequence of layers of an image or video compression,} _{transmission and decompression pipeline.} _{Figure 11 illustrates an example of how optical flow information may be calculated between} _{two images.} _{Figure 12a illustrates steps of a MAD cost volume calculation.} _{Figure 12b illustrates steps of a MAD cost volume calculation.} _{Figure 12c illustrates steps of a MAD cost volume calculation.} _{Figure 13 illustrates steps of an RKADe cost volume calculation.} _{Figure 14 illustrates an example of an image or video compression, transmission and decom-} _{pression pipeline.} _{DETAILED DESCRIPTION OF THE DRAWINGS} _{Compression processes may be applied to any form of information to reduce the amount} _{of data, or file size, required to store that information. Image and video information is an} _{example of information that may be compressed. The file size required to store the information,} _{particularly during a compression process when referring to the compressed file, may be} _{referred to as the rate. In general, compression can be lossless or lossy. In both forms of} _{compression, the file size is reduced. However, in lossless compression, no information is lost} _{when the information is compressed and subsequently decompressed. This means that the} _{original file storing the information is fully reconstructed during the decompression process.} _{In contrast to this, in lossy compression information may be lost in the compression and} _{decompression process and the reconstructed file may differ from the original file. Image and} _{video files containing image and video data are common targets for compression.} _{In a compression process involving an image, the input image may be represented as ^^. The} _{data representing the image may be stored in a tensor of dimensions ^^ × ^^ × ^^, where ^^} _{represents the height of the image, ^^ represents the width of the image and ^^ represents the} _{number of channels of the image. Each ^^ × ^^ data point of the image represents a pixel value} _{of the image at the corresponding location. Each channel ^^ of the image represents a different} _{component of the image for each pixel which are combined when the image file is displayed by} _{a device. For example, an image file may have 3 channels with the channels representing the} _{red, green and blue component of the image respectively. In this case, the image information} _{is stored in the RGB colour space, which may also be referred to as a model or a format.} _{Other examples of colour spaces or formats include the CMKY and the YCbCr colour models.} _{However, the channels of an image file are not limited to storing colour information and other} _{information may be represented in the channels. As a video may be considered a series of} _{images in sequence, any compression process that may be applied to an image may also be} _{applied to a video. Each image making up a video may be referred to as a frame of the video.} _{The output image may differ from the input image and may be represented by ^^. The difference} _{between the input image and the output image may be referred to as distortion or a difference} _{in image quality. The distortion can be measured using any distortion function which receives} _{the input image and the output image and provides an output which represents the difference} _{between input image and the output image in a numerical way. An example of such a method} _{is using the mean square error (MSE) between the pixels of the input image and the output} _{image, but there are many other ways of measuring distortion, as will be known to the person} _{skilled in the art. The distortion function may comprise a trained neural network.} _{Typically, the rate and distortion of a lossy compression process are related. An increase in} _{the rate may result in a decrease in the distortion, and a decrease in the rate may result in an} _{increase in the distortion. Changes to the distortion may affect the rate in a corresponding} _{manner. A relation between these quantities for a given compression technique may be defined} _{by a rate-distortion equation.} _{AI based compression processes may involve the use of neural networks. A neural network is} _{an operation that can be performed on an input to produce an output. A neural network may} _{be made up of a plurality of layers. The first layer of the network receives the input. One or} _{more operations may be performed on the input by the layer to produce an output of the first} _{layer. The output of the first layer is then passed to the next layer of the network which may} _{perform one or more operations in a similar way. The output of the final layer is the output of} _{the neural network.} _{Each layer of the neural network may be divided into nodes. Each node may receive at least} _{part of the input from the previous layer and provide an output to one or more nodes in a} _{subsequent layer. Each node of a layer may perform the one or more operations of the layer on} _{at least part of the input to the layer. For example, a node may receive an input from one or} _{more nodes of the previous layer. The one or more operations may include a convolution, a} _{weight, a bias and an activation function. Convolution operations are used in convolutional} _{neural networks. When a convolution operation is present, the convolution may be performed} _{across the entire input to a layer. Alternatively, the convolution may be performed on at least} _{part of the input to the layer.} _{Each of the one or more operations is defined by one or more parameters that are associated} _{with each operation. For example, the weight operation may be defined by a weight matrix} _{defining the weight to be applied to each input from each node in the previous layer to each} _{node in the present layer. In this example, each of the values in the weight matrix is a parameter} _{of the neural network. The convolution may be defined by a convolution matrix, also known} _{as a kernel. In this example, one or more of the values in the convolution matrix may be a} _{parameter of the neural network. The activation function may also be defined by values which} _{may be parameters of the neural network. The parameters of the network may be varied during} _{training of the network.} _{Other features of the neural network may be predetermined and therefore not varied during} _{training of the network. For example, the number of layers of the network, the number of} _{nodes of the network, the one or more operations performed in each layer and the connections} _{between the layers may be predetermined and therefore fixed before the training process takes} _{place. These features that are predetermined may be referred to as the hyperparameters of the} _{network. These features are sometimes referred to as the architecture of the network.} _{To train the neural network, a training set of inputs may be used for which the expected output,} _{sometimes referred to as the ground truth, is known. The initial parameters of the neural} _{network are randomized and the first training input is provided to the network. The output of} _{the network is compared to the expected output, and based on a difference between the output} _{and the expected output the parameters of the network are varied such that the difference} _{between the output of the network and the expected output is reduced. This process is then} _{repeated for a plurality of training inputs to train the network. The difference between the} _{output of the network and the expected output may be defined by a loss function. The result of} _{the loss function may be calculated using the difference between the output of the network} _{and the expected output to determine the gradient of the loss function. Back-propagation of} _{the gradient descent of the loss function may be used to update the parameters of the neural} _{network using the gradients ^^^^/^^^^ of the loss function. A plurality of neural networks in a} _{system may be trained simultaneously through back-propagation of the gradient of the loss} _{function to each network.} _{In the context of image or video compression, this type of system, where simultaneous training} _{with back-propagation through each element or the whole network architecture may be referred} _{to as end-to-end, learned image or video compression. Unlike in traditional compression} _{algorithms that use primarily handcrafted, manually constructed steps, an end-to-end learned} _{system learns itself during training what combination of parameters best achieves the goal of} _{minimising the loss function. This approach is advantageous compared to systems that are not} _{end-to-end learned because an end-to-end system has a greater flexibility to learn weights and} _{parameters that might be counter-intuitive to someone handcrafting features.} _{It will be appreciated that the term "training" or "learning" as used herein means the process} _{of optimizing an artificial intelligence or machine learning model, based on a given set of data.} _{This involves iteratively adjusting the parameters of the model to minimize the discrepancy} _{between the model’s predictions and the actual data, represented by the above-described} _{rate-distortion loss function.} _{The training process may comprise multiple epochs. An epoch refers to one complete pass} _{of the entire training dataset through the machine learning algorithm. During an epoch, the} _{model’s parameters are updated in an effort to minimize the loss function. It is envisaged that} _{multiple epochs may be used to train a model, with the exact number depending on various} _{factors including the complexity of the model and the diversity of the training data.} _{Within each epoch, the training data may be divided into smaller subsets known as batches.} _{The size of a batch, referred to as the batch size, may influence the training process. A smaller} _{batch size can lead to more frequent updates to the model’s parameters, potentially leading to} _{faster convergence to the optimal solution, but at the cost of increased computational resources.} _{Conversely, a larger batch size involves fewer updates, which can be more computationally} _{efficient but might converge slower or even fail to converge to the optimal solution.} _{The learnable parameters are updated by a specified amount each time, determined by the} _{learning rate. The learning rate is a hyperparameter that decides how much the parameters} _{are adjusted during the training process. A smaller learning rate implies smaller steps in the} _{parameter space and a potentially more accurate solution, but it may require more epochs to} _{reach that solution. On the other hand, a larger learning rate can expedite the training process} _{but may risk overshooting the optimal solution or causing the training process to diverge.} _{The training described herein may involve use of a validation set, which is a portion of the} _{data not used in the initial training, which is used to evaluate the model’s performance and to} _{prevent overfitting. Overfitting occurs when a model learns the training data too well, to the} _{point that it fails to generalize to unseen data. Regularization techniques, such as dropout or} _{L1/L2 regularization, can also be used to mitigate overfitting.} _{It will be appreciated that training a machine learning model is an iterative process that} _{may comprise selection and tuning of various parameters and hyperparameters. As will be} _{appreciated, the specific details, such as hyper parameters and so on, of the training process} _{may vary and it is envisaged that producing a trained model in this way may achieved in a} _{number of different ways with different epochs, batch sizes, learning rates, regularisations,} _{and so on, the details of which are not essential to enabling the advantages and effects of the} _{present disclosure, except where stated otherwise. The point at which an “untrained” neural} _{network is considered be “trained” is envisaged to be case specific and depend on, for example,} _{on a number of epochs, a plateauing of any further learning, or some other metric and is not} _{considered to be essential in achieving the advantages described herein.} _{More details of an end-to-end, learned compression process will now be described. It will be} _{appreciated that in some cases, end-to-end, learned compression processes may be combined} _{with one or more components that are handcrafted or trained separately.} _{In the case of AI based image or video compression, the loss function may be defined by the} _{rate distortion equation. The rate distortion equation may be represented by ^^^^^^^^ = ^^ + ^^ ∗ ^^,} _{where ^^ is the distortion function, ^^ is a weighting factor, and ^^ is the rate loss. ^^ may be} _{referred to as a lagrange multiplier. The langrange multiplier provides as weight for a particular} _{term of the loss function in relation to each other term and can be used to control which terms} _{of the loss function are favoured when training the network.} _{In the case of AI based image or video compression, a training set of input images may} _{be used. An example training set of input images is the KODAK image set (for example} _{at www.cs.albany.edu/ xypan/research/snr/Kodak.html). An example training set of input} _{images is the IMAX image set. An example training set of input images is the Imagenet} _{dataset (for example at www.image-net.org/download). An example training set of input} _{images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at} http://challenge.compression.cc/tasks/). _{An example of an AI based compression, transmission and decompression process 100 is} _{shown in Figure 1. As a first step in the AI based compression process, an input image 5 is} _{provided. The input image 5 is provided to a trained neural network 110 characterized by a} _{function ^^^^ acting as an encoder. The encoder neural network 110 produces an output based} _{on the input image. This output is referred to as a latent representation of the input image 5. In} _{a second step, the latent representation is quantised in a quantisation process 140 characterised} _{by the operation ^^, resulting in a quantized latent. The quantisation process transforms the} _{continuous latent representation into a discrete quantized latent. An example of a quantization} _{process is a rounding function.} _{In a third step, the quantized latent is entropy encoded in an entropy encoding process 150 to} _{produce a bitstream 130. The entropy encoding process may be for example, range or arithmetic} _{encoding. In a fourth step, the bitstream 130 may be transmitted across a communication} network. _{In a fifth step, the bitstream is entropy decoded in an entropy decoding process 160. The} _{quantized latent is provided to another trained neural network 120 characterized by a function} _{^^^^ acting as a decoder, which decodes the quantized latent. The trained neural network 120} _{produces an output based on the quantized latent. The output may be the output image of the} _{AI based compression process 100. The encoder-decoder system may be referred to as an} autoencoder. _{Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly} _{compress given input data up to close to the fundamental entropy limit of that data, as determined} _{by the total entropy of the distribution of that data. Accordingly, one way in which end-to-end,} _{learned compression can minimise the rate loss term of the rate-distortion loss function and} _{thereby increase compression effectiveness is to learn autoencoder parameter values that} _{produce low entropy latent representation distributions. Producing latent representations} _{distributed with as low an entropy as possible allows entropy encoding to compress the latent} _{distributions as close to or to the fundamental entropy limit for that distribution. The lower} _{the entropy of the distribution, the more entropy encoding can losslessly compress it and the} _{lower the amount of data in the corresponding bitstream. In some cases where the latent} _{representation is distributed according to a gaussian or Laplacian distribution, this learning} _{may comprise learning optimal location and scale parameters of the gaussian or Laplacian} _{distributions, in other cases, it allows the learning of more flexible latent representation} _{distributions which can further help to achieve the minimising of the rate-distortion loss} _{function in ways that are not intuitive or possible to do with handcrafted features. Examples of} _{these and other advantages are described in WO2021/220008A1, which is incorporated in its} _{entirety by reference.} _{Something which is closely linked to the entropy encoding of the latent distribution and which} _{accordingly also has an effect on the effectiveness of compression of end-to-end learned} _{approaches is the quantisation step. During inference, a rounding function may be used to} _{quantise a latent representation distribution into bins of given sizes, a rounding function is} _{not differentiable everywhere. Rather, a rounding function is effectively one or more step} _{functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary} _{between steps). Back propagating a gradient of a loss function through a rounding function} _{is challenging. Instead, during training, quantisation by rounding function is replaced by} _{one or more other approaches. For example, the functions of a noise quantisation model are} _{differentiable everywhere and accordingly do allow backpropagation of the gradient of the} _{loss function through the quantisation parts of the end-to-end, learned system. Alternatively, a} _{straight-through estimator (STE) quantisation model or one other quantisation models may be} _{used. It is also envisaged that different quantisation models may be used for during evaluation} _{of different term of the loss function. For example, noise quantisation may used to evaluate the} _{rate or entropy loss term of the rate-distortion loss function while STE quantisation may be} _{used to evaluate the distortion term.} _{In a similar manner to how learning parameters to produce certain distributions of the latent} _{representation facilitates achieving better rate loss term minimisation, end-to-end learning of} _{the quantisation process achieves a similar effect. That is, learnable quantisation parameters} _{provide the architecture with a further degree of freedom to achieve the goal of minimising the} _{loss function. For example, parameters corresponding to quantisation bin sizes may be learned} _{which is likely to result in an improved rate-distortion loss outcome compared to approaches} _{using hand-crafted quantisation bin sizes.} _{Further, as the rate-distortion loss function constantly has to balance a rate loss term against a} _{distortion loss term, it has been found that the more degrees of freedom the system has during} _{training, the better the architecture is at achieving optimal rate and distortion trade off.} _{Returning to the compression pipeline more generally, the systems described above may be} _{distributed across multiple locations and/or devices. For example, the encoder 110 may be} _{located on a device such as a laptop computer, desktop computer, smart phone or server. The} _{decoder 120 may be located on a separate device which may be referred to as a recipient device.} _{The system used to encode, transmit and decode the input image 5 to obtain the output image 6} _{may be referred to as a compression pipeline.} _{As described above in the context of quantisation, the AI based compression process may} _{further comprise a hyper-network 105 for the transmission of meta-information that improves} _{the compression process. The hyper-network 105 comprises a trained neural network 115} _{acting as a hyper-encoder ^^ ℎ ℎ} ^^ _{and a trained neural network 125 acting as a hyper-decoder ^^} ^^. _{An example of such a}

_{is shown in Figure 2. Components of the system not further} _{discussed may be assumed to be the same as discussed above. The neural network 115 acting as} _{a hyper-decoder receives the latent that is the output of the encoder 110. The hyper-encoder 115} _{produces an output based on the latent representation that may be referred to as a hyper-latent} _{representation. The hyper-latent is then quantized in a quantization process 145 characterised} _{by ^^ℎ to produce a quantized hyper-latent. The quantization process 145 characterised by ^^ℎ} _{may be the same as the quantisation process 140 characterised by ^^ discussed above.} _{In a similar manner as discussed above for the quantized latent, the quantized hyper-latent is} _{then entropy encoded in an entropy encoding process 155 to produce a bitstream 135. The} _{bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the} _{quantized hyper-latent. The quantized hyper-latent is then used as an input to trained neural} _{network 125 acting as a hyper-decoder. However, in contrast to the compression pipeline 100,} _{the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder} _{115. Instead, the output of the hyper-decoder is used to provide parameters for use in the} _{entropy encoding process 150 and entropy decoding process 160 in the main compression} _{process 100. For example, the output of the hyper-decoder 125 can include one or more of} _{the mean, standard deviation, variance or any other parameter used to describe a probability} _{model for the entropy encoding process 150 and entropy decoding process 160 of the latent} _{representation. In the example shown in Figure 2, only a single entropy decoding process 165} _{and hyper-decoder 125 is shown for simplicity. However, in practice, as the decompression} _{process usually takes place on a separate device, duplicates of these processes will be present} _{on the device used for encoding to provide the parameters to be used in the entropy encoding} _{process 150.} _{Further transformations may be applied to at least one of the latent and the hyper-latent at any} _{stage in the AI based compression process 100. For example, at least one of the latent and the} _{hyper latent may be converted to a residual value before the entropy encoding process 150,155} _{is performed. The residual value may be determined by subtracting the mean value of the} _{distribution of latents or hyper-latents from each latent or hyper latent. The residual values} _{may also be normalised.} _{To perform training of the AI based compression process described above, a training set of} _{input images may be used as described above. During the training process, the parameters of} _{both the encoder 110 and the decoder 120 may be simultaneously updated in each training} _{step. If a hyper-network 105 is also present, the parameters of both the hyper-encoder 115} _{and the hyper-decoder 125 may additionally be simultaneously updated in each training step.} _{The training process may further include a generative adversarial network (GAN). When} _{applied to an AI based compression process, in addition to the compression pipeline described} _{above, an additional neutral network acting as a discriminator is included in the system. The} _{discriminator receives an input and outputs a score based on the input providing an indication} _{of whether the discriminator considers the input to be ground truth or fake. For example, the} _{indicator may be a score, with a high score associated with a ground truth input and a low} _{score associated with a fake input. For training of a discriminator, a loss function is used that} _{maximizes the difference in the output indication between an input ground truth and input fake.} _{When a GAN is incorporated into the training of the compression process, the output image 6} _{may be provided to the discriminator. The output of the discriminator may then be used in the} _{loss function of the compression process as a measure of the distortion of the compression} _{process. Alternatively, the discriminator may receive both the input image 5 and the output} _{image 6 and the difference in output indication may then be used in the loss function of the} _{compression process as a measure of the distortion of the compression process. Training of} _{the neural network acting as a discriminator and the other neutral networks in the compression} _{process may be performed simultaneously. During use of the trained compression pipeline} _{for the compression and transmission of images or video, the discriminator neural network is} _{removed from the system and the output of the compression pipeline is the output image 6.} _{Incorporation of a GAN into the training process may cause the decoder 120 to perform} _{hallucination. Hallucination is the process of adding information in the output image 6 that} _{was not present in the input image 5. In an example, hallucination may add fine detail to} _{the output image 6 that was not present in the input image 5 or received by the decoder 120.} _{The hallucination performed may be based on information in the quantized latent received by} _{decoder 120.} _{Details of a video compression process will now be described. As discussed above, a video is} _{made up of a series of images arranged in sequential order. AI based compression process} _{100 described above may be applied multiple times to perform compression, transmission} _{and decompression of a video. For example, each frame of the video may be compressed,} _{transmitted and decompressed individually. The received frames may then be grouped to} _{obtain the original video.} _{The frames in a video may be labelled based on the information from other frames that is used} _{to decode the frame in a video compression, transmission and decompression process. As} _{described above, frames which are decoded using no information from other frames may be} _{referred to as I-frames. Frames which are decoded using information from past frames may be} _{referred to as P-frames. Frames which are decoded using information from past frames and} _{future frames may be referred to as B-frames. Frames may not be encoded and/or decoded in} _{the order that they appear in the video. For example, a frame at a later time step in the video} _{may be decoded before a frame at an earlier time.} _{The images represented by each frame of a video may be related. For example, a number of} _{frames in a video may show the same scene. In this case, a number of different parts of the} _{scene may be shown in more than one of the frames. For example, objects or people in a scene} _{may be shown in more than one of the frames. The background of the scene may also be} _{shown in more than one of the frames. If an object or the perspective is in motion in the video,} _{the position of the object or background in one frame may change relative to the position of} _{the object or background in another frame. The transformation of a part of the image from} _{a first position in a first frame to a second position in a second frame may be referred to as} _{flow, warping or motion compensation. The flow may be represented by a vector. One or more} _{flows that represent the transformation of at least part of one frame to another frame may be} _{referred to as a flow map.} _{An example AI based video compression, transmission, and decompression process 200 is} _{shown in Figure 3. The process 200 shown in Figure 3 is divided into an I-frame part 201} _{for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be} _{understood that these divisions into different parts are arbitrary and the process 200 may be} _{also be considered as a single, end-to-end pipeline.} _{As described above, I-frames do not rely on information from other frames so the I-frame part} _{201 corresponds to the compression, transmission, and decompression process illustrated in} _{Figures 1 or 2. The specific details will not be repeated here but, in summary, an input image} _{^^0 is passed into an encoder neural network 203 producing a latent representation which is} _{quantised and entropy encoded into a bitstream 204. The subscript 0 in ^^0 indicates the input} _{image corresponds to a frame of a video stream at position t = 0. This may be the first frame of} _{an entire video stream or the first frame of a chunk of a video stream made up of, for example,} _{an I-frame and a plurality of subsequent P-frames and/or B-frames. The bitstream 204 is then} _{entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed} _{image ^^0 which in this case is an I-frame. The decoding step may be performed both locally} _{at the same location as where the input image compression occurs as well as at the location} _{where the decompression occurs. This allows the reconstructed image ^^0 to be available for} _{later use by components of both the encoding and decoding sides of the pipeline.} _{In contrast to I-frames, P-frames (and B-frames) do rely on information from other frames.} _{Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only} _{the input image ^^^^ that is to be compressed (corresponding to a frame of a video stream at} _{position t), but also one or more previously reconstructed images ^^^^−1 from an earlier frame} _{t-1. As described above, the previously reconstructed ^^^^−1 is available at both the encode} _{and decode side of the pipeline and can accordingly be used for various purposes at both the} _{encode and decode sides.} _{At the encode side, previously reconstructed images may be used for generating a flow maps} _{containing information indicative of inter-frame movement of pixels between frames. In the} _{example of Figure 3, both the image being compressed ^^^^ and the previously reconstructed} _{image from an earlier frame ^^^^−1 are passed into a flow module part 206 of the pipeline. The} _{flow module part 206 comprises an autoencoder such as that of the autoencoder systems of} _{Figures 1 and 2 but where the encoder neural network 207 has been trained to produce a} _{latent representation of a flow map from inputs ^^^^−1 and ^^^^ , which is indicative of inter-frame} _{movement of pixels or pixel groups between ^^^^−1 and ^^^^ . The latent representation of the flow} _{map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208.} _{On the decode side, the bitstream is entropy decoded and passed to a decoder neural network} _{209 to produce a reconstructed flow map ^^ .} _{The reconstructed flow map ^^ is applied to the previously reconstructed image ^^^^−1 to generate} _{a warped image ^^^^−1,^^. It is envisaged that any suitable warping technique may be used, for} _{example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen, D., Johnston,} _{N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized} _{video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and} _{Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further} _{envisaged that a scale-space flow approach as described in the above paper may also optionally} _{be used. The warped image ^^^^−1,^^ is a prediction of how the previously reconstructed image} _{^^^^−1 might have changed between frame positions t-1 and t, based on the output flow map} _{produced by the flow module part 206 autoencoder system from the inputs of ^^^^ and ^^^^−1.} _{As with the I-frame, the reconstructed flow map ^^ and corresponding warped image ^^^^−1,^^} _{may be produced both on the encode side and the decode side of the pipeline so they are} _{available for use by other components of the pipeline on both the encode and decode sides.} _{In the example of Figure 3, both the image being compressed ^^^^ and the ^^^^−1,^^ are passed} _{into a residual module part 210 of the pipeline. The residual module part 210 comprises an} _{autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the} _{encoder neural network 211 has been trained to produce a latent representation of a residual} _{map indicative of differences between the input mage ^^^^ and the warped image ^^^^−1,^^. The} _{latent representation of the residual map is then quantised and entropy encoded into a bitstream} _{212 and transmitted. The bitstream 212 is then entropy decoded and passed into a decoder} _{neural network 213 which reconstructs a residual map ^^ from the decoded latent representation.} _{Alternatively, a residual map may first be pre-calculated between ^^^^ and the ^^^^−1,^^ and the} _{pre-calculated residual map may be passed into an autoencoder for compression only. This} _{hand-crafted residual map approach is computationally simpler, but reduces the degrees of} _{freedom with which the architecture may learn weights and parameters to achieve its goal} _{during training of minimising the rate-distortion loss function.} _{Finally, on the decode side, the residual map ^^ is applied (e.g. combined by addition, subtraction} _{or a different operation) to the warped image to produce a reconstructed image ^^^^ which is a} _{reconstruction of image ^^^^ and accordingly corresponds to a P-frame at position t in a sequence} _{of frames of a video stream. It will be appreciated that the reconstructed image ^^^^ can then be} _{used to process the next frame. That is, it can be used to compress, transmit and decompress} _{^^^^+1, and so on until an entire video stream or chunk of a video stream has been processed.} _{Thus, for a block of video frames comprising an I-frame and ^^ subsequent P-frames, the} _{bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame} _{image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual} _{map of each P-frame image. For completeness, whilst not illustrated in Figure 3, any of the} _{autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those} _{described in connection with Figure 2. Accordingly, the bitstream may also contain hyper and} _{hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so} _{on, of those networks as applicable.} _{Finally, the above approach may generally also be extended to B-frames, for example as is} _{described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame} _{coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp.} 6680-6689). _{The above-described flow and residual based approach is highly effective at reducing the} _{amount of data that needs to be transmitted because, as long as at least one reconstructed frame} _{(e.g. I-frame ^^^^−1) is available, the encode side only needs to compress and transmit a flow} _{map and a residual map (and any hyper or hyper-hyper parameter information, as applicable)} _{to reconstruct a subsequent frame.} _{Figure 4 shows an example of an AI image or video compression process such as that described} _{above in connection with Figures 1-3 implemented in a video streaming system 400. The} _{system 400 comprises a first device 401 and a second device 402. The first and second devices} _{401, 402 may be user devices such as smartphones, tablets, AR/VR headsets or other portable} _{devices. In contrast to known systems which primarily perform inference on GPUs such as} _{Nvidia A100, Geforce 3090, Gefore 4090 GPU cards, the system 400 of Figure 4 performs} _{inference on a CPU (or NPU or TPU as applicable) of the first and second devices respectively.} _{That is, compute for performing both encoding and decoding are performed by the respective} _{CPUs of the first and second devices 401, 402. This places very different power usage, memory} _{and runtime constraints on the implementation of the above methods than when implementing} _{AI-based compression methods on GPUs. In one example, the CPU of first and second devices} _{401, 402 may comprise, for example, a Qualcomm Snapdragon CPU.} _{The first device 401 comprises a media capture device 403, such as a camera, arranged to} _{capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404. The} _{video stream 404 is passed to a pre-processing module 406 which splits the video stream into} _{blocks of frames, various frames of which will be designated as I-frames, P-frames, and/or} _{B-frames. The blocks of frames are then compressed by an AI-compression module 407} _{comprising the encode side of the AI-based video compression pipeline of Figure 3. The} _{output of the AI-compression module is accordingly a bitstream 408a which is transmitted} _{from the first device 401, for example via a communications channel, for example over one} _{or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409} communications. _{The second device 402 receives the communicated bitstream 408b which is passed to an} _{AI-decompression module 410 comprising the decode side of the AI-based video compression} _{pipeline of Figure 3. The output of the AI-decompression module 402 is the reconstructed} _{I-frames, P-frames and/or B-frames which are passed to a post-processing module 411 where} _{they can prepared, for example passed into a buffer, in preparation for streaming 412 to and} _{rendering on a display device 413 of the second device 402.} _{It is envisaged that the system 400 of Figure 4 may be used for live video streaming at 30fps} _{of a 1080p video stream, which means a cumulative latency of both the encode and decode} _{side is below substantially 50ms, for example substantially 30ms or less. Achieving this} _{level of runtime performance with only CPU (or NPU or TPU) compute on user devices} _{presents challenges which are not addressed by known methods and systems or in the wider} _{AI-compression literature.} _{For example, execution of different parts of the compression pipeline during inference} _{may be optimized by adjusting the order in which operations are performed using one or} _{more known CPU scheduling methods. Efficient scheduling can allow for operations to be} _{performed in parallel, thereby reducing the total execution time. It is also envisaged that} _{efficient management of memory resources may be implemented, including optimising caching} _{methods such as storing frequently-accessed data in faster memory locations, and memory} _{reuse, which minimizes memory allocation and deallocation operations.} _{A number of concepts related to the AI compression processes and/or their implementation} _{in a hardware system discussed above will now be described. Although each concept is} _{described separately, one or more of the concepts described below may be applied in an AI} _{based compression process as described above.} _{Concept 1: Training super resolution with learned image and video compression} _{Figure 5 illustrates an example of an image or video compression, transmission and decompres-} _{sion pipeline 500. The pipeline illustrates a method of the present disclosure that corresponds} _{to that described in relation to Figures 1 to 4. Like-numbered features correspond to those in} _{Figures 1 to 4. However in Figure 5, pipeline is further wrapped in a super-resolution wrapper.} _{That is, the encoder ^^ is preceded by a downsampler 501, and decoder ^^ is followed by an} _{upsampler 502.} _{We first introduce the super-resolution wrapper 501, 502 around the pipeline 500 during} _{training which comprises making the evaluated loss function be based on one or more terms} _{from the list of: a difference between the output image ^^ and the input image ^^ , a difference} _{between the output image ^^ and the downsampled input image ^^, a difference between the} _{upsampled output image ^̂^ and the input image ^^ , and/or a difference between the upsampled} _{output image ^̂^ and the downsampled input image ^^. Modifying the loss function in this way} _{allows the pipeline to be "super-resolution" aware. That is, the loss function comprises a term} _{that is not just based on differences between the input to the encoder and the output of the} _{decoder, but also or alternatively on the input ^^ and the output ^^ of the downsampler 501} _{and/or the input ^^ and output ^^ of the upsampler 502. In this way, the weights of the neural} _{networks of the pipeline 500 (e.g. the encoder ^^ , the decoder ^^, and/or the corresponding} _{hyper encoder and hyper decoders) will be optimised to allow the encoder ^^ to produce a} _{latent representation that has a distribution that is optimally entropy encodable to hit low target} _{bit rates while at the same time allow the decoder ^^ to output images that have distributions} _{that can be optimally upsampled by the upsampler 502 to produce upsampled output images ^̂^} _{that are as close to the original input images ^^ as possible.} _{More generally, making the neural compression pipeline 500 super-resolution aware in this} _{way during training results in trained networks of the pipeline 500 that, when wrapped in the} _{super-resolution down- and/or up-samplers 501, 502 during inference, produces output images} _{^̂^ that are closer to the the input images ^^ than a network or networks that were not trained in} _{a super-resolution aware manner.} _{Accordingly, the method comprises receiving an input image ^^ at a first computer system,} _{downsampling the input image with a downsampler 501 to produce a downsampled input} _{image ^^, encoding the downsampled input image ^^ using a first neural network ^^ to produce} _{a latent representation, decoding the latent representation using a second neural network ^^} _{to produce an output image ^^, wherein the output image ^^ is an approximation of the input} _{image (e.g. of the downsampled input image ^^ which in turn is an approximation of the} _{original input image ^^), upsampling the output image ^^ with an upsampler 502 to produce an} _{upsampled output image ^̂^ , evaluating a function (i.e. a loss function) based on a difference} _{between one or more of: the output image ^^ and the input image ^^ , the output image ^^ and the} _{downsampled input image ^^, the upsampled output image ^̂^ and the input image ^^ , and/or the} _{upsampled output image ^̂^ and the downsampled input image ^^, updating the parameters of} _{the first neural network ^^ and the second neural network ^^ based on the evaluated function,} _{and repeating the above steps using a first set of input images to produce a first trained neural} _{network and a second trained neural network.} _{In the above-described method, it is envisaged that the upsampler 502 may comprise a third} _{neural network, and the method further comprises updating the parameters of the third neural} _{network based on the evaluated function.} _{This is illustrated in Figure 6 showing a pipeline 600 corresponding to that of Figure 5 but} _{where the upsampler 602 is shown as a neural network ^^ . The upsampler neural network ^^ is} _{responsible for upscaling the output image obtained from the second neural network back to} _{its original size or resolution.} _{The upsampler neural network ^^ may comprise a convolutional neural network architecture. For} _{example, the upsampler may be implemented using transposed convolutions or deconvolution} _{layers. These layers perform the inverse operation of regular convolutions and can be used to} _{increase the spatial resolution of an image.} _{The upsampling process can be further enhanced using various techniques such as skip} _{connections or residual connections. Skip connections allow for direct transmission of} _{information from earlier layers in the network to later layers, bypassing some of the intermediate} _{layers and thereby allowing the model to leverage detailed information present in the initial} _{stages of processing. Residual connections add the output of a layer directly to the input of} _{another layer, effectively performing addition or subtraction operations within the network.} _{These techniques can improve the accuracy and stability of the neural network-based upsampler} _{by allowing it to better capture fine details in the image.} _{More specifically, a neural network based upsampler such as ^^ can be trained together, e.g. in} _{an end-to-end manner with the trainable neural networks of the neural compression pipeline} _{600, making the entire pipeline super-resolution aware. This is an important distinction} _{compared to simply applying a super resolution wrapper to a neural or traditional compression} _{pipeline. This is because it affords the freedom of the networks of the pipeline to learn to} _{produce latent representations that compress optimally that get decoded into output images ^^} _{that may not be visually pleasing or indeed look to the human visual system like an accurate} _{reconstruction of the original input images ^^ or their downsampled versions ^^, but which} _{nonetheless have pixel value distributions that the upsampler neural network ^^ can take better} _{advantage of to produce upsampled output images ^̂^ that are more accurate approximations of} _{the original input images ^^ than an upsampler that is acting on an output of a compression} _{pipeline that is not super-resolution aware (in this case upsampler aware)} _{In the example of Figure 6, the downsampler may be a traditional downsampler, for example it} _{is envisaged that the downsampler may comprise a bilinear or bicubic downsampler.} _{Bilinear and bicubic downsampling are exemplary methods used for image resizing. They} _{involve reducing the resolution of an input image, for example by a factor of 2x2 (e.g., from} _{100x100 to 50x50). Further exemplary details of bilinear and bicubic downsampling are} _{provided below.} _{Bilinear Downsampling: In bilinear downsampling, the algorithm approximates the original} _{pixel values based on the average intensity values of its surrounding pixels in the scaled image.} _{This method assumes that the pixel intensities are uniformly distributed across the image. The} _{basic implementation for a 2x2 downsampling operation is as follows:} _{1. Read the four input pixels (A, B, C, and D).} _{2. Set the output pixel value to be an average of the input pixels.} _{3. Replace the output pixel positions with their respective calculated values from step 2.} _{Bicubic Downsampling: In bicubic downsampling, the algorithm uses a third-order polynomial} _{function to approximate the original pixel values based on the intensities of a neighborhood} _{surrounding the current pixel. This method takes into account more details than bilinear} _{interpolation but requires more computational resources. The basic implementation for a 2x2} _{downsampling operation is as follows:} _{1. Read the four input pixels (A, B, C, and D).} _{2. Calculate the coefficients of the third-order polynomial function.} _{3. Calculate the output pixel values: x and y, using the coefficients obtained in step 2.} _{In the context of the the compression pipeline 600 shown in Figure 6, either bilinear and} _{bicubic downsampling methods can be used. The choice between these two methods depends} _{on the desired tradeoff between computation complexity and visual quality of the output images} _{whereby bilinear is envisaged to be quicker and accordingly contributes to faster run times} _{on the encode side of the pipeline while retaining the neural network based upsampler ^^ to} _{emphasise reconstruction accuracy.} _{Optionally, the evaluated function (i.e. the loss function) may further be based on said} _{differences comprising a structural similarity index measure (SSIM). The SSIM is a quality} _{metric that compares two images in terms of their structure and contrast. For example, when} _{used here it evaluates the similarity between an output image and its corresponding input image} _{or downsampled input image, upsampled output image, and/or any combination thereof. By} _{using the SSIM as the evaluation function, the method aims to optimize the neural networks for} _{preserving the structural information in the images during the encoding and decoding process,} _{thus improving the overall quality of the generated output images. As the human visual system} _{is often able to perceive these higher level structural information, it allows the nerwork to} _{learn to optimise for this type of difference over a simpler mean square error (MSE) loss.} _{Alternatively, MSE may be used as it is quicker and simpler to calculate and can accordingly} _{speed up training times.} _{In the above-described method, the downsampling can be performed using a Gaussian blur} _{filter. That is, it is envisaged that downsampling can be achieved through an implementation} _{wherein the input image ^^ is filtered with a Gaussian blur. The Gaussian blur filter is a} _{type of low-pass filter that smoothes out the image by reducing high-frequency details while} _{preserving lower frequency information. This helps to reduce the complexity of the input} _{image and makes it easier for the first, second and third neural networks to learn the underlying} _{patterns in the data and may help the loss to converge during training. In this implementation,} _{the downsampled input image ^^ produced is a smoother representation of the original input} _{image, which can be used as an input for encoding using the first neural network ^^ .} _{It is envisaged that end-to-end training is preferable as it makes the pipeline fully super-} _{resolution aware. However, it can be challenging to get the loss to converge during training} _{and/or for training to be stable when all neural networks are being optimised simultaneously.} _{This training instability and slow convergence can be mitigated by splitting the training into} _{multiple phases. For example, the method may comprise (i) updating the parameters of the} _{first neural network ^^ and the second neural network ^^ based on the evaluated function for} _{a first number of said steps to produce the first and second trained neural networks without} _{performing said upsampling and downsampling and without updating the parameters of the} _{third neural network ^^ , (ii) freezing the parameters of the first and second neural networks ^^ , ^^} _{after said number first number of steps, and performing said upsampling and downsampling,} _{and said updating of the parameters of the third neural network ^^ for a second number of said} steps. _{That is, the underlying compression pipeline 600 is trained first. After this initial training} _{phase, the parameters of the first and second neural networks ^^ , ^^ are frozen, and then the} _{system proceeds with a secondary training phase where it updates the third neural network’s ^^} parameters. _{As described above, this split training approach can mitigate training instability.} _{Moving now to Figure 7 which shows a compression pipeline 700 similar to that of Figure 5,} _{except now the downsampler comprises a neural network 701. The features that correspond} _{to those of Figure 5 are not repeated here for brevity. More specifically, it is envisaged that} _{the downsampler 701 may consist of a fourth neural network ^^. Further the training method} _{also comprises updating the parameters of this fourth neural network ^^ based on the evaluated} function. _{The fourth neural network ^^, referred to as the downsampler 701, is responsible for reducing} _{the spatial resolution of the input image while preserving important features and details in} _{order to process them efficiently during encoding by the first neural network ^^ . In a similar} _{manner as described in connection with the upsampler in Figure 6, the downsampler neural} _{network ^^ may be trained in an end-to-end manner with the first and second neural networks ^^ ,} _{^^ of the compression pipeline 700. This approach provides the end-to-end system with an} _{extra degree of freedom to produce downsampled input images ^^ that may not be in any way} _{visually pleasing or accurate representations of ^^ but which can be optimally encoded by ^^} _{into a latent representation that is distributed in a way that is efficiently entropy encodable and} _{which can be decoded and subsequently upsampled into more accurate reconstructions of ^^} _{than would otherwise be possible with a pipeline that is not super-resolution aware (in this} _{case down-sampler aware).} _{That is, unlike in traditional super resolution approaches where the downsampling attempts} _{to produce as accurate or visually pleasing representations of the input image, the output} _{of the neural network downsampler ^^ can produce whatever output is needed by the neural} _{compression pipeline to help achieve a target bitrate while maintaining final output image} accuracy. _{One examplary downsampler having a neural network architecture is a network comprising} _{a plurality of convolutional layers with decreasing filter sizes and increasing strides, as this} _{approach can effectively reduce the spatial dimensions of the input image while maintaining} _{its overall structure.} _{In more detail, let’s consider a simple example of a downsampling process using a convolutional} _{neural network (CNN). The CNN architecture typically consists of multiple layers, each layer} _{being composed of several filters applied across the spatial dimensions of the input image.} _{These filters are learnable parameters that enable the network to extract various features from} _{the image and recognize complex patterns or shapes within it.} _{In the case of a downsampler, we can start with an initial convolutional layer having a large} _{filter size (e.g., 7x7) and small strides (e.g., 2x2). This combination results in a significant} _{reduction in spatial dimensions while still allowing the network to capture essential information} _{from the input image. Subsequent layers can then use smaller filters (e.g., 3x3, 5x5) with larger} _{strides (e.g., 2x2, 4x4), further reducing the size of the feature maps while also encouraging} _{more localized receptive fields within the network. One drawback of large kernel sizes is} _{that they are more computationally expensive than smaller kernels, even if they are strictly} _{speaking more expressive.} _{The choice of downsampler, and filter sizes and strides in a downsampler controls the balance} _{between preserving important image details and efficiently processing the data. In practice,} _{the downsampler may comprise multiple convolutional layers with decreasing filter sizes and} _{increasing strides, followed by one or more max-pooling layers to further reduce the spatial} _{dimensions of the input image. Reducing the number of layers and using small filters or kernels} _{helps to speed up run time.} _{A further illustrative downsampler may be comprise a network architecture with a number} _{of layers with a stride greater than 1. Every such layer will downsample by a factor of the} _{stride. While the filter or kernel sizes can affect the spatial dimensions of the input (i.e. bigger} _{kernel resulting in greater downsampling) zero padding may be applied in such a way that} _{the output remains the same size as the input if stride equals 1. This type of downsampler} _{structure is typically fast and accordingly works well in the context of real time or near real} _{time compression.} _{As alluded to above, in Figure 7, it is envisaged that the upsampler may comprise a bilinear or} _{bicubic upsampler.} _{Bilinear upsampling (or interpolation) is a method for upsampling an image. It involves} _{estimating pixel values by linearly interpolating between the neighboring pixels in the original} _{and downsampled images.} _{An example implementation algorithm may be as follows:} _{1. For each pixel in the output image, find its corresponding location or pixel-coordinate in the} _{input image.} _{2. Find four nearby pixel coordinates around this central coordinate of the input image. These} _{are typically referred to as northeast (NE), southeast (SE), northwest (NW), and southwest} (SW). _{3. Compute a weighted average of these four pixels, where the weights depend on their} _{distances from the desired output pixel location. The weights are usually determined by a} _{bilinear function.} _{4. Repeat steps 1-3 for every pixel in the output image.} _{Bicubic upsampling (or interpolation) is technique to upscale an image. It is similar to bilinear} _{interpolation but uses a bicubic function instead of a linear one. The algorithm is slightly} _{more complex, and the resulting images tend to have smoother edges than those produced by} _{bilinear interpolation.} _{An example implementaiton algorithm may be as follows:} _{1. For each pixel in the output image, find its corresponding pixel in the input image.} _{2. Find 16 nearby pixel coordinates around this central coordinate of the input image. These} _{are typically referred to as northeast (NE), north-northeast (NNE), northwest (NW), southwest} _{(SW), southeast (SE), south-southeast (SSE), south (SO), and south-southwest (SSW).} _{3. Fit a bicubic function with coefficients that comprise weighted sums of the values of the} _{input pixels.} _{4. Repeat steps 1-3 for every pixel in the output image.} _{Both bilinear and bicubic interpolation can produce reasonably good results when upsampling} _{images, but the choice between them will depend on the specific use case and desired level of} _{detail preservation. In the present case, bilinear is envisaged to be preferred as it is faster and} _{can reduce runtime while working in a pipeline 700 with a (typically slower) neural network} _{based downsampler such as ^^ on the encode side.} _{As was the case with the neural network upsampler ^^ , training in a fully end-to-end manner} _{may introduce training instability and slow convergence. To address this, the training may} _{be split into phases. For example, the method may comprise (i) updating the parameters of} _{the first neural network and the second neural network based on the evaluated function for} _{a first number of said steps to produce the first and second trained neural networks without} _{performing said upsampling and downsampling and without updating the parameters of the} _{fourth neural network, (ii) freezing the parameters of the first and second neural networks after} _{said number of steps, and performing said downsampling and said updating of the parameters} _{of the fourth neural network for a second number of said steps.} _{More generally, training a neural network based downsampler ^^ in an end-to-end manner is} _{further complicated by it not being straightforward as to what the downsampler’s training} _{objective should be i.e. what its loss ought to be based on. For example, should the loss terms} _{that include the output of the downsampler ^^ compare the input image ^^ and the immediate} _{output of the downsampler ^^ i.e. ^^, or on a downsampled version of ^^ that was produced by a} _{previously downsampled image (e.g. created by a traditional downsampling method) to try} _{to teach the downsampler to mimic a traditional downsampler, or on some other difference.} _{The present inventors have found that as long as the loss function includes a term based on} _{comparing the output of ^^ with something that is not just the original input image ^^ but also} _{some other output of the pipeline and/or a downsampled image produced by a traditional} _{method of downsampling, the loss converges more quickly indicating the neural network} _{downsampler is learning to become super resolution aware.} _{It is also envisaged that both the upsampler and downsampler may be neural networks. This is} _{shown in Figure 8 which corresponds to Figure 5 but where the upsamplers and downsamplers} _{comprise neural networks. Like-numbered references indicate like features which are not} _{repeated here for brevity. Specifically Figure 8 illustratively shows a pipeline 800 comprising} _{a neural network downsampler ^^ 801 and a neural network upsampler ^^ 802 wrapped around a} _{neural compression pipeline. It is envisaged that these may be trained in an end to end fashion.} _{More specifically, by making the loss function be based on comparisons between not just the} _{input image ^^ and the final upsampled output image ^̂^ , but also between the outputs of the} _{downsampler ^^, and the various other outputs of the pipeline, as well as optionally a previously} _{downsampled image, the network learns to become super-resolution aware and outperforms} _{networks where the training of the neural compression pipeline is not connected in any way} _{either through training or through the calculated comparisons of the terms of the loss function.} _{Considering all of the above approaches more generally now, it is envisaged that the methods} _{described above may comprise entropy encoding the latent representation into a bitstream} _{with a specified length, wherein the function used in the method is also dependent on said} _{bitstream length. That is, the loss function further comprises a rate term. Including the rate} _{term in the loss function allows the networks to learn to optimise for bit rates (e.g. in bits per} _{pixel) simultaneously with image reconstruction accuracy. Some or all of the loss terms may} _{be scaled or weighted with respect to each other to focus the learning on any of the objectives} _{as defined by the different loss terms.} _{Additionally and/or alternatively, the difference between one or more of: the output image} _{and the input image, the output image and the downsampled input image, the upsampled} _{output image and the input image, and/or the upsampled output image and the downsampled} _{input image may be determined based on the output of a fifth neural network acting as a} discriminator. _{It will be understood that the output of the discriminator may be the differentiation between a} _{ground truth image and a "fake" (i.e. compressed) reconstructed image during training. The} _{discriminator loss term used in the training of the encoder/decoders of the AI compression} _{pipeline are only a function of the compressed image. In essence, the training tries to encourage} _{the neural networks to change in such a way that its output will be more realistic and like the} _{ground truth image. Faithfulness to the ground truth image is taken care of by the distortion} _{loss term (e.g. mean squared error) or other loss. The evaluation of the function based on} _{these differences guides the process of updating the parameters of the first neural network} _{(encoder) and the second neural network (decoder), leading to improved performance and} _{better results over time. This approach can help to improve the overall quality of the generated} _{output images.} _{It is accordingly also envisaged that the difference between one or more of: the output image} _{and the input image, the output image and the downsampled input image, the upsampled output} _{image and the input image, and/or the upsampled output image and the downsampled input} _{image comprises a mean squared error (MSE) and/or a structural similarity index measure} _{(SSIM). These correspond to the distortion term of the loss function.} _{MSE measures the average squared difference between two images, while SSIM computes} _{structural similarity based on luminance, contrast, and structure. By using these metrics, the} _{method is able to optimize the parameters of the neural networks to better approximate the} _{input image in subsequent iterations, resulting in improved image reconstruction performance} _{over multiple runs with different sets of input images.} _{In the method as described above, the function may further comprise a term defining a visual} _{perceptual metric that models how a human visual system may perceive differences. In the} _{above-described method, it is envisaged that the term defining a visual perceptual metric may} _{comprise a MS-SIM loss. This loss function serves to gauge how effectively the network is} _{approximating the input image with the output image. By iteratively minimizing this loss} _{function through parameter updates in the neural networks, the trained neural network improves} _{its ability to generate an output image that closely resembles the input image.} _{It is further noted that the above described methods may be used in the context of any} _{pre-trained neural compression network, and accordingly the present disclosure envisages a} _{method where only the weights of the upsampler and/or downsampler are updated during} _{training. Accordingly, such a method comprises receiving an input image at a first computer} _{system, encoding the input image using a first neural network to produce a latent representation,} _{decoding the latent representation using a second neural network to produce an output image,} _{wherein the output image is an approximation of the input image. Next, the output image is} _{upsampled with an upsampler comprising a third neural network. Thereafter, the difference} _{between one or both of: the output image and the input image, and/or the upsampled output} _{image and the input image is evaluated using a function, parameters of the third neural network} _{are updated based on the evaluated function. These steps are then repeated using a first set of} _{input images to produce a first trained neural network and a second trained neural network.} _{Or in the case where only the downsampler weights are trained, the present disclosure envisages} _{a method comprising receiving an input image at a first computer system, downsampling the} _{input image with a downsampler comprising a fourth neural network to produce a downsampled} _{input image, encoding the downsampled input image using a first neural network to produce} _{a latent representation, decoding the latent representation using a second neural network to} _{produce an output image that approximates the input image, evaluating a function based on} _{differences between the output image and other images, updating the parameters of the fourth} _{neural network based on the evaluated function, and repeating these steps with a first set of} _{input images to create trained versions of the first and second neural networks.} _{As described above, it is envisaged that producing the previously downsampled input image} _{may be performed by either bilinear or bicubic downsampling techniques.} _{Finally the present disclosure also proposes using a network trained in accordance with the} _{above-described methods. That is: receiving an input image at a first computer system,} _{downsampling the input image with a downsampler, encoding the downsampled input image} _{using a first trained neural network to produce a latent representation, transmitting the latent} _{representation to a second computer system, decoding the latent representation using a} _{second trained neural network to produce an output image, wherein the output image is an} _{approximation of the input image, and upsampling the output image with an upsampler to} _{produce an upsampled output image.} _{A specific example use cases of the above described super resolution approaches will now be} _{described: more flexible bitrate ladders.} _{A bitrate ladder refers to a set of predefined bitrates applied within an encoded file. In the} _{context of video compression, it refers to a series of different bitrates that can be chosen to} _{achieve the desired trade-off between video quality and file size.} _{In general, when encoding video files, a balance is struck between two conflicting goals:} _{achieving high visual quality while minimizing the file size. The process involves converting} _{the raw data into a compressed format that requires less storage space.} _{To accomplish this task in traditional compression, a number of known algorithms are used,} _{often dictated by compression codec standards. One such standard is H.264/AVC (Advanced} _{Video Coding), which is widely adopted due to its balance between encoding complexity} _{and image quality. The H.264 standard includes various profiles and levels that define the} _{maximum bitrate and other parameters for a given video stream. Implementers of the standard} _{typically target these profiles and levels to ensure their implementation is standard compliant.} _{More specifically, When using these standards, implementers can choose from different preset} _{bitrates. These predefined bitrates are often referred to as a "ladder" because they represent a} _{series of steps or options available when choosing the optimal encoding settings for a given} _{video file. Typically, bitrate ladders make use of the idea that the encode resolution can be} _{varied a priori whereby the streaming provider has already pre-encoded its content at a plurality} _{of different resolutions, which in turn facilitates giving an end user a choice of what quality} _{setting to apply given some particular bitrate budget.} _{The bitrate ladder approach works by progressively decreasing the bitrate from one level to} _{another, allowing a given balance between quality and file size to be found. For example,} _{starting at a higher-than-desired bitrate, you can gradually reduce it until you reach an acceptable} _{level of visual degradation without sacrificing too much detail or clarity.} _{In a neural compression pipeline, the neural networks are typically trained to perform optimally} _{at a given bitrate. Accordingly, covering all the predefined bitrates of a given bitrate ladder may} _{require training separate neural networks for each predefined bitrate which can be burdensome} _{and may result in the final codec library memory footprint being potentially very large.} _{To overcome this issue, the above-described approaches may be used to make the base,} _{neural compression neural networks super-resolution aware and allows a given base, neural} _{compression pipeline to be used not only for compression to its targeted bitrate, but also to} _{other bitrates in the bitrate ladder by applying the super-resolution wrapper around the base} _{models when desired. This in turn makes it more viable to use neural compression pipelines} _{in the implementation of a bitrate ladder to comply with a given codec standard, or to aim} _{for an entirely different bitrate ladder that provides even better rate-quality trade offs than} _{traditional codec bitrate ladders, given how significantly better neural compression pipelines} _{can compress images and video compared to known codecs.} _{Concept 2: Lightweight convolutional downsampling and upsampling} _{As has been explained above in the general section, there is a wide gap between the ideation} _{of high level ideas, the research-stage implementations of super resolution architectures,} _{and the production level implementation of such architectures. This is particularly the case} _{where the implementation is intended to be performed in real time or near real time on} _{resource-constrained hardware, such as edge devices. For example, a research-stage approach} _{that works well and fast on a GPU such as an NVIDIA 4090, A10 or A100 card is very unlikely} _{to achieve the same performance on resource-constrained mobile device platforms such as} _{laptops, tablets and smartphones.} _{One area of AI-based video compression where this is particularly problematic is in the} _{implementation of downsampling and upsampling algorithms.} _{More particularly, one common component of such upsampling and downsampling algorithms} _{is a process known as PixelShuffle and PixelUnshuffle. Both operations manipulate the} _{arrangement of data in tensors (multi-dimensional arrays) that represent images.} _{PixelShuffle is often used in super-resolution models. That is, in general terms, PixelShuffle} _{increases the resolution of an input image by rearranging the elements of a tensor.} _{The pseudocode below illustrates a PixelShuffle operation:} _{Input Tensor: shape [^^^^^^^^ℎ^^^^^^^^, ^^ ∗ ^^2, ^^, ^^], where: C is the number of channels (e.g., 3} _{for an RGB image), r is the upscale factor, H and W are the height and width of the tensor.} _{Rearrangement of Data: PixelShuffle rearranges elements in this tensor to form a new tensor} _{of shape [^^^^^^^^ℎ^^^^^^^^, ^^, ^^ ∗ ^^, ^^ ∗ ^^]. Essentially, it "shuffles" the data from the channel} _{dimension into the spatial dimensions (height and width).} _{Upscaling: This operation effectively upscales the image by a factor of r, increasing the} _{resolution. For example, if r = 2, each pixel in the original image is rearranged to form a 2x2} _{block in the output image.} _{Application in Super-Resolution: In super-resolution models like EDSR or SRGAN, PixelShuf-} _{fle is used in the latter stages to upscale the low-resolution input to a high-resolution output.} _{It’s a part of the sub-pixel convolution technique where the model first increases the number of} _{channels with additional convolutions and then uses PixelShuffle to upscale the image spatially.} _{PixelUnshuffle is the reverse operation of PixelShuffle. It’s used to decrease the spatial} _{resolution of an image while increasing the number of channels.} _{The pseudocode below illustrates a PixelUnshuffle operation:} _{Input Tensor: shape [^^^^^^^^ℎ^^^^^^^^, ^^, ^^, ^^].} _{Rearrangement of Data: PixelUnshuffle rearranges elements to form a new tensor of shape} _{[^^^^^^^^ℎ^^^^^^^^, ^^ ∗ ^^2, ^^/^^, ^^/^^]. It does so by taking spatial blocks of size r x r and stacking} _{them depth-wise in the channel dimension.} _{Downscaling: This process effectively downscales the image by a factor of r, reducing its} _{spatial dimensions. For example, if r = 2, a 2x2 block of pixels in the input image is rearranged} _{into a single pixel in the output, with the depth (channels) increased by a factor of 4.} _{Application: PixelUnshuffle can be used in tasks like feature extraction, where reducing spatial} _{resolution while retaining information in the channel dimension might be beneficial. It’s also} _{useful in certain generative models or autoencoders where manipulating spatial resolution at} _{different stages of the network is required.} _{More generally, PixelShuffle is used for upscaling an image by rearranging the channel data} _{into the spatial dimensions, whereas PixelUnshuffle does the opposite, downscaling an image} _{by rearranging spatial data into the channel dimension.} _{PixelShuffle and PixelUnshuffle are specific implementations of "depth-to-space" and "space-} _{to-depth" operations. These can be explained in their generalised form as follows.} _{The pseudocode below illustrates a depth-to-space operation:} _{Input Tensor: The operation takes an input tensor of shape [^^^^^^^^ℎ^^^^^^^^, ^^ ∗ ^^2, ^^, ^^], where} _{C is the number of channels, r is the upscale factor, and H and W are the height and width.} _{Rearrangement of Data: It rearranges the elements of this tensor to form a new tensor of shape} _{[^^^^^^^^ℎ^^^^^^^^, ^^, ^^ ∗ ^^, ^^ ∗ ^^]. This rearrangement involves redistributing the elements from} _{the depth (channels) into the spatial dimensions (height and width).} _{Upscaling Effect: The result is an upscaling of the image or feature map by a factor of r, with} _{each pixel in the original tensor contributing to a block of pixels in the output tensor.} _{The pseudocode below illustrates a space-to-depth operation:} _{Input Tensor: the input of a space-to-depth operation is a tensor of shape [^^^^^^^^ℎ^^^^^^^^, ^^, ^^, ^^].} _{Rearrangement of Data: Space-to-Depth rearranges elements to produce a new tensor of} _{shape [^^^^^^^^ℎ^^^^^^^^, ^^ ∗ ^^2, ^^/^^, ^^/^^]. It does this by taking blocks of pixels from the spatial} _{dimensions and stacking them in the channel dimension.} _{Downscaling Effect: This leads to a reduction in the spatial resolution by a factor of r, while} _{increasing the depth (channels) of the tensor.} _{A problem with all of the above approaches is that they involve a very large number of small} _{parallelisable operations. These can be very efficiently performed in parallel on GPUs but cause} _{bottlenecks and dramatic decreases in performance on CPUs or other more resource-constrained} hardware. _{The present inventors have realised that both depth-to-space and space-to-depth operations} _{used in upsampling and downsampling can be replaced wholly with convolutional operations.} _{This is made possible due by virtue of the realisation that the change in dimensions from} _{the key "rearrangement of data" steps of depth-to-space and space-to-depth can be achieved} _{by one or more convolutional operations, for example performed by corresponding one or} _{more convolutional layers in a neural network. Convolutional operations are typically already} _{optimised at the chip level in commonly available commercial hardware chips on laptops,} _{tablets and smartphones (such as the M2 and M3 Apple chips, and the Qualcomm snapdragon} _{chips, and the Meteor Lake Intel chips, and others). Accordingly, replacing depth-to-space and} _{space-to-depth with convolutions allows upsampling and downsampling to be performed far} _{more efficiently than traditional depth-to-space and space-to-depth operations.} _{Whilst the above improvement is envisaged to be used in the context of, for example super-} _{resolution such as that described in concept 1 such as in the upsamplers and/or downsamplers} _{in Figures 5 to 8. It is generally applicable to any instances where upsampling or downsampling} _{might be performed in a neural compression pipeline. For example, one or more layers in the} _{first or second neural networks of Figures 1, 2, 11 or 14 may be configured to downsample} _{or upsample an input. Or there may be an intermediate layer within these networks, or} _{modules (not shown in the Figures) positioned throughout the pipeline that may perform} _{downsampling or upsampling. In each of these cases it is envisaged that these downsampling or} _{upsampling operations may be performed without depth-to-space or space-to-depth approaches,} _{but with convolutional operations instead. Alternatively, there may be some combination of} _{depth-to-space or space-to-depth in some instances but convolutional operations in others.} _{For example, the flow module 206 in Figure 3 may comprise one or more layers or modules} _{configured to downsample the input. This is one way to speed up runtime as the flow often need} _{not be estimated in as high a resolution as the input image as the output quality of reconstructed} _{images created with high resolution flow can be similar to that of the quality of reconstructed} _{images created with a low resolution flow. This downsampling, if performed using traditional} _{space-to-depth would be a bottleneck. However, replacing space-to-depth with one or more} _{convolutional operations removes this bottleneck due to the convolutional operations being} _{faster and better optimised for hardware-constrained platforms. The corresponding inverse} _{upsampling may then be applied at the output of the flow module 206. again using convolutional} _{operations rather than depth-to-space.} _{A corresponding set of operations may be performed with the residual module of Figure 3, in} _{any of the modules of the hypernetwork in Figure 2, and/or in any of the compression Pipeline} _{of Figure 1.} _{An exemplary implementation of mimicking space-to-depth (i.e. downsampling) may be as} follows. _{CONVOLUTIONAL LAYER SETUP:} _{Kernel Size: The kernel size is envisaged to match the block size that is to be mimicked.} _{For example, if the block size for space-to-depth is r (say, 2 for a 2x2 block), the corresponding} _{convolution kernel size would be r x r (2x2 in the above example).} _{Stride: The stride size is envisaged to equal the block size (r). This ensures that the} _{convolutional filters move across the image in steps equal to the block size, effectively capturing} _{the spatial blocks.} _{Number of Filters: It is envisaged that the number of filters is set to ^^ ∗ ^^2, where C is the} _{original number of channels. This ensures that each filter produces an output that corresponds} _{to one depth level in the Space-to-Depth transformation} _{SEQUENTIAL CONVOLUTION LAYERS:} _{To fully replicate Space-to-Depth, it is envisaged that a number of convolutions may} _{be used sequentially, e.g. by using a series of convolutional layers. This allows for the} _{handling of where the channel increase (^^^^^^^^ ∗ ^^2) is significant. Each layer progressively} _{accumulates more spatial information into the depth dimension. For completeness, it is also} _{possible to replicate space-to-depth with a single strided convolution. Splitting it into multiple} _{convolutions with activations between them provides additional expressive power, but isn’t} _{needed to just replicate the functionality of space-to-depth.} _{ACTIVATION FUNCTIONS:} _{It is also envisaged that one or more activation functions may be applied between} _{sequential convolution layers. These help in spreading out the information spatially. An} _{exemplary activation function may be ReLU, which introduces a non-linearity and helps in} _{learning spatial patterns.} _{CHANNEL REARRANGEMENT:} _{Finally, after these convolutional operations, the output channels may optionally be} _{rearranged to match the order that a space-to-depth operation produces. Alternatively, this can} _{be done at any point in the process, and need not be done "on the fly".} _{An exemplary implemenation of mimicking depth-to-space (i.e. upsampling) may be as} follows: C_{ONVOLUTIONAL LAYER SETUP:} _{Kernel Size: A kernel size is envisaged that aligns with the desired spatial expansion. For} _{example, if the upscale factor is r, a larger kernel size (like 3x3 or larger) can be more effective} _{in spreading out the information across a larger spatial area. Alternatively, the depth-to-space} _{operation can be implemented with a single transposed convolution with stride equal to the} _{upsampling factor. It is also possible to make a more expressive process by using larger kernel} _{sizes and/or splitting a more extensive upsample into multiple stages.} _{Stride: It is envisaged that the stride may be set to 1, ensuring a uniform spread of} _{information. Or the stride may be equal to the upsampling factor, as described above.} _{Number of Filters: This may be less than the original number of channels to reduce the} _{channel dimension gradually. The exact number can vary depending on the architecture and} _{desired output.} _{SEQUENTIAL CONVOLUTION LAYERS:} _{Multiple convolutional layers may be advantageous, especially if the change from depth} _{to spatial dimensions is significant. Each layer can gradually increase the spatial dimensions} _{and reduce the depth.} _{ACTIVATION FUNCTIONS:} _{It is also envisaged that one or more activation functions may be applied between} _{sequential convolution layers. These help in spreading out the information spatially. An} _{exemplary activation function may be ReLU, which can introduce non-linearity and help in} _{learning spatial patterns.} _{UPSAMPLING LAYERS:} _{Alongside the convolutional layers, upsampling layers (like nearest neighbor or bilinear} _{upsampling) can be used to increase the spatial dimensions. These can be alternated with} _{convolutional layers to progressively achieve the desired spatial expansion.} _{The above exemplary implementation is illustrative only and is not intended to be limiting. For} _{example, any suitable kernel size, stride and filter numbers are envisaged, as are the number of} _{optional sequential convolution layers, activation functions and other steps.} _{By way of illustration, Figure 9 shows an example sequence of layers of a neural network} _{which takes an input image and downsamples it. The sequence of layers comprises a 3x3} _{Conv layer, a ReLU activation function, a space-to-depth (2x) operation, a 1x1 conv layer,} _{another ReLU activation layer and finally a depth-to-space (3x) operation. This approach} _{uses space-to-depth and depth-to-space. Implementing this sequence of layers in resource} _{constrained environments e.g. on a CPU results in bottlenecks.} _{In contrast, Figure 10 illustratively shows an example sequence of layers of a neural network} _{or compression pipeline, for example one or more of the neural networks or compression} _{pipelines shown in any of Figures 1 to 8 but where the space-to-depth and depth-to-space} _{operations have been replaced by convolution operations. For example, the depth-to-space} _{replacement comprises a strided transposed convolution and the space-to-depth replacement} _{comprises a strided convolution. This implementation substantially reduces the bottlenecks} _{when running in resource-constrained environments such as on a CPU.} _{Concept 3: Regional Kernel Absolute Deviation (RKADe) for Flow} _{In image processing, Mean Absolute Difference (MAD) is a technique for detecting and} _{numerically estimating differences between pixels and/or pixel patches, that is, differences} _{between the values of a one or more pixels in one image and the values one or more pixels in} _{another image.} _{In the context of AI-based compression pipelines, MAD may be used in the estimation of cost} _{volumes in order to estimate flow as part of a flow-residual compression pipeline such as that} _{shown in Figures 3, 11 and 14.} _{Figure 3 has already been described above. Figure 11 illustrates an example of a flow module,} _{in this case a network 1100, configured to estimate information indicative of a difference} _{between an image ^^^^−1 and an image ^^^^ , e.g. flow information. Figure 11 is provided as a} _{non-limiting example of how such flow information may be calculated between two images.} _{The flow module may be used in or together with the flow module part 206 (Figure 3) of the} _{compression pipeline. An alternative approach to estimating flow is described in Mentzer, F.,} _{Agustsson, E., Ballé, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural} _{video compression using gans for detail synthesis and propagation. In Computer Vision–ECCV} _{2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part} _{XXVI (pp. 562-578), which is hereby incorporated by reference.} _{The network 1100 in Figure 5 comprises a set of layers 501a, 501b respectively for an image} _{^^^^−1 and an image ^^^^ from respective times or positions in a sequence ^^ − 1 and ^^ of a sequence} _{of frames. The set of layers 1101a, 1101b may define one or more convolution operations} _{and/or nonlinear activations (for example as described in concept 2 above) to sequentially} _{downsample the input images to produce a pyramid of feature maps for different levels of} _{coarseness or spatial resolution. This may comprise performing ℎ/2 ^^/2 downsampling} _{in a first layer, ℎ/4 ^^/4 downsampling in a second layer ℎ/8 ^^/8 downsampling in a third} _{layer, ℎ/16 ^^/16 downsampling in a fourth layer, ℎ/32 ^^/32 downsampling in a fifth layer,} _{ℎ/64 ^^/64 downsampling in a sixth layer, and so on. It will of course be appreciated that} _{these downsampling operations and levels of coarseness or spatial resolution of a pyramid} _{feature map are exemplary only and other layers, operations and levels are also envisaged. For} _{example, other operations, not only those from concept 2 above, may be used.} _{With the downsampling operations performed and the corresponding pyramid of feature maps} _{generated, a first cost volume 1102 is calculated at the most coarse level between the feature} _{map pixels of the first image ^^^^−1 and the corresponding feature map pixels of the second image} _{^^^^ . Cost volumes define the matchmaking cost of matching the pixels in one image with the} _{pixels in a second image (which may be later in time, or earlier in time, for example due to the} _{order in which B-frame processing typically occurs which is not necessarily the chronological} _{order of the frames). That is, the closeness of each pixel, or a subset of all pixels, in the initial} _{image to one or more pixels in the later image is determined with a measure of similarity} _{such as a vector or dot product, a cosine similarity, a mean absolute difference, or some other} _{measure of similarity. This metric may be calculated against all pixels in the later image, or} _{only for pixels in a predetermined search radius such as a 1-10 pixel radius (preferably a 1, 2,} _{3, or 4 pixel radius), or some other radius as described in connection with concept 4 below,} _{around the pixel coordinate corresponding to the pixel against which the closeness metric is} _{being calculated. This process is computationally expensive in floating points space but which} _{can be implemented efficiently in integer or fixed point space.} _{Once the first cost volume 1102 at the most coarse level is estimated, a first flow 1103 can be} _{estimated from the first cost volume 1102. This may be achieved using, for example a flow} _{extractor network which may comprise a convolutional neural network comprising a plurality} _{of layers trained to output a tensor defining a flow map from the input cost volumes. Other} _{methods of calculating flow information from cost volumes will also be known to the skilled} person. _{The same process is then repeated for the other levels of feature map coarseness to calculate} _{a second cost volume 1104 and second flow 1105, and so on for the cost volumes and flows} _{associated with each of the levels of coarseness until they have all been calculated, up to the} _{final cost volume 1106 and flow 1107.} _{The weights and/or biases of any activation layers in network 1100 (e.g. optionally in the} _{downsampling convolution layers and/or in a flow extractor network that produces flow maps} _{from the cost volumes) are trainable parameters and can accordingly be updated during training} _{either alone, or in an end to end manner with the rest of the compression pipeline. The trainable} _{nature of these parameters provides the network 1100 with flexibility to produce feature maps} _{at each level of spatial resolution (i.e. pyramid feature maps) and/or at the flow outputs} _{that are forced into a distribution that best allows the network to meet its training objective} _{(e.g. better compression, better reconstruction accuracy, more accurate reconstruction of flow,} _{and so on). For example, it allows the network 1100 to produce feature maps that, when} _{cost volumes and/or flows are calculated therefrom, produce cost volumes or flows that are} _{distributed roughly matching the latent representation distribution that would previously have} _{been expected to be output by a dedicated flow encoder module. This effectively allows a} _{dedicated flow encoder to be omitted entirely from the flow compression part of the pipeline.} _{Optionally, for each level of coarseness or resolution, the flow of the previous level or levels} _{of coarseness or resolution may be used to warp 1108, 1109, the feature maps before the} _{cost volume is calculated. This has the effect of artificially reducing the amount of relative} _{movement between the pixels of the t and t - 1 images or feature maps when calculating the} _{cost volumes, reducing flow errors for high movement details. Optionally, removing warping} _{entirely or in some levels of coarseness or resolution can substantially reduce run-time of flow} _{calculation while maintaining good levels of flow accuracy.} _{As the warping process uses inputs from different levels of coarseness or spatial resolution,} _{the flow estimation output may be upsampled 1110, 1111 (for example using the methods of} _{concept 2, or using any other upsampling method) first to match the coarseness resolution of} _{the feature map to which the flow is being applied in the warping process.} _{The outputs of the flow module may accordingly be one or more cost volumes or some} _{representation thereof, and/or one or more flows or some representation thereof).} _{The flow or representation thereof may then be transmitted in a bitstream and decoded by a} _{flow decoder, the output of which may in turn be used to warp a previously decoded image for} _{use in the residual encoder/decoder 1110 arrangement as shown in e.g. Figure 3.} _{Returning now to the estimation of cost volumes, the cost volumes may be used to compute local} _{(translational) alignment through patch-wise comparisons. For example, let ^^, ^^ ∈ R^^×^^×^^} _{be two tensors each with ^^ ∈ N channels and spatial dimensions ^^ × ^^ ∈ N2. A standard} _{cost-volume to compute is} _{CV(^^, ^^)^^, ^^ := ^^^(:,^^) , ^^ (:,^^+ ^^)^,} _{where the subscript denotes}

_{, ^^ ∈ [^^] × [^^] and ^^ = ( ^^1, ^^2) ∈} _{{−^^, . .. , ^^}2 with ^^ ∈ N being the radius of the cost volume.} _{The function above ^·, ·^ is the inner product and means that the cost volume is constructed by} _{computing the channel-wise correlation.} _{MAD may be used in the calculation of cost volumes as follows.} _{Let P : R^^×^^×^^ → R^^×(2^^+1)2×^^×^^ be a patch operator that associates to each pixel} _{^^ ∈ {^^}}

_{centered at pixel ^^ for an integer ^^ ≥ 0.} _{Then the MAD cost volume is defined by} ∑ ^{^^}^ c_{v (^^, ^^) := ∥P(^^) − P( 2} _{mad ^^, ^^ ^^,^^ ^^)^^,^^+ ^^ ∥1, ^^ ∈ {^^} × {^^}, ^^ ∈ {−^^, .. . , ^^} .}

_{Exemplary r values that may be used include r = 1 (giving comparisons of 3x3 patches), r = 2} _{(giving comparisons of 5x5 patches), r = 3 (giving comparisons of 7x7 patches), and more.} _{The MAD-based cost volume estimation is more computationally efficient than other known} _{cost volume estimation methods and accordingly synergistically helps to reduce run times of} _{the flow estimation of an AI-based compression pipeline. However, the present inventors have} _{found that implementing the operations used to perform MAD calculations at the machine} _{level has a number of downsides, particularly when the MAD calculations are implemented} _{using convolutions.} _{Specifically, an input tensor of arbitrary ^^^^^^^^^^ dimensions is typically stored in non-} _{contiguous blocks of memory. For example, in one example, the values of the elements of} _{^^^^ and ^^^^ for one channel ^^^^ of an input tensor may be stored in a first block of memory} _{while the values of the elements ^^^^ and ^^^^ for another channel ^^^^ may be stored in a second} _{block of memory and so on. When a MAD value is estimated that involves this multi-channel} _{input tensor, the values of the elements of ^^^^ and ^^^^ are accessed from one of the memory} _{blocks, the values of the elements of ^^^^ and ^^^^ are accessed from another of the memory} _{blocks, and so on. This means the number of memory access operations can be very high} _{even for relatively simple operations. Consider for example the following pseudo code for} _{implementing the above-described MAD approach may be as follows:} _{1) Receive input tensor input_1 representing first image, and input tensor input_2 represent-} _{ing second image.} _{2) Apply repeated interleaving of input_1 to obtain tensor x of desired shape:} _{x = repeat_interleave(input_1).} _{3) Apply flattening and unfolding of input_2 to obtain tensor y of desired shape:} _{y = flatten_unfold(input_2).} _{4) Calculate absolute difference between tensors x and y:} _{absolute_differences = absolute_difference_calculation(x, y).} _{5) Sum the absolute differences with a strided sum operation:} _{mad_output = strided_sum(absolute_differences).} _{In the above example, the strided sum operation comprises depth or channel-wise grouped} _{convolutions (i.e. convolutions applied in depth-wise groups, each group taking as inputs the} _{values of spatial dimensions ^^ and ^^ stored in non-contiguous memory blocks). That is, the} _{stride is in the depth (i.e. channel) dimension necessitating the access and retrieval of the} _{values of the elements of ^^ and ^^ stored in separate memory blocks. Alternatively, if the} _{grouping of the convolutions is in the spatial dimensions rather than the depth dimension, the} _{non-contiguous memory block problem still arises but now in the depth dimension. In other} _{words, the use of grouped convolutions results in a convolution-based MAD approach that has} _{a memory access bottleneck during run time.} _{Accordingly, implementing MAD-based, naive, patch-wise comparisons using convolutional} _{operations on most commercial hardware CPUs, GPUs and NPUs (i.e. neural accelerators)} _{is slow due to the interleaving of data in memory that results from the order in which the} _{operations of the convolutions are performed.} _{A schematic of this strided sum based implementation of a MAD cost volume calculation is} _{shown in Figures 12a, 12b and 12c. Illustrated in Figure 12a is a toy representation of input} _{tensors 1200a, 1200b respectively associated with a first image and a second image. Each} _{input tensor has 3 channels: channel 1 (Ch1), channel 2 (Ch2) and channel 3 (Ch3). In a first} _{step, repeat interleaving 1201 is performed on the first image input tensor 1200a to produce a} _{first intermediate output. In a second step, an unfold convolution operation 1202 is performed} _{on the second image input tensor 1200b to produce a second intermediate output. The unfold} _{convolution operations are performed as group convolutions and accordingly each group is} _{assigned its own memory block. In a third step 1203, an absolute difference is estimated} _{between the intermediate outputs of the first step and the second step to produce a third} _{intermediate output. As is shown in Figure 12b, the elements of the third intermediate step are} _{stored in said respective, different memory blocks - in this case memory blocks 1, 2 and 3} _{(Mb1, Mb2, Mb3) to match the number of outputs Finally, a strided sum 1204 is performed} _{on the estimated absolute differences stored in the respective memory blocks 1, 2, and 3 to} _{produce the MAD output cost volume tensor. Again by virtue of the group convolutions of} _{the strided sum 1204 and by virtue of the memory block locations, this operation requires} _{non-contiguous memory blocks to be accessed for each of the convolutions of the strided sum} _{1204, as is illustrated in Figure 12c. Whilst it would in principle be possible to introduce a} _{shuffling of the layers of the input tensor earlier in the flow, for example before the unfold} _{convolutions and/or before the strided sum convolutions, this introduces an additional shuffling} _{operation which is also slow given the number of memory read and write operations required} _{to complete a full interleaving or de-interleaving process, and can also further complicate any} _{other parts of the AI compression pipeline that rely on this information in unshuffled form.} _{The additional shuffling operation to resolve the downstream issue is accordingly undesirable} _{and does not result in appreciable run-time improvements.} _{To address these and other problems, it is helpful to first consider in more detail the purpose of} _{the cost volume in the presently described AI-compression pipeline. Specifically, the goal of} _{the cost volume is to construct a spatial comparison operator that encodes some notion of how} _{two patches are related. For a given pixel ^^, the cost volume thus has a measure of comparison} _{between ^^ and ^^ for a collection of local offsets ^^ . The present inventors have realised that any} _{effective encoding of this information can suffice in AI-compression pipelines because the} _{neural networks that make up the flow encoder and/or decoder and residual encoder and/or} _{decoder, and indeed any other neural networks that make up the AI-compression pipelines of} _{Figures 1-14, are able to learn to accommodate the encoding of this information, regardless} _{of how it is represented. It cannot be understated how significant of an advantage this is for} _{end-to-end AI-based compression pipelines over traditional compression methods.} _{In particular, this facilitates the use of approaches to estimating cost volumes that simply} _{aren’t viable in traditional compression pipelines. Presently described concept 3 is directed} _{to such approaches, which are described herein as compressive approaches. That is, the use} _{of a compressive encoding of the measure of comparison between ^^ and ^^ for a collection of} _{local offsets ^^ , is made possible and in particular a compressive cost volume estimation (i.e. a} _{compressively encoded cost volume) is made possible.} _{In traditional, local cost volume approaches, a pixel is compared to pixels in a given reference} _{image that lie within a given radius. In the context of flow estimation, this may be the} _{comparison of a pixel in a first image to pixels in a radius around a corresponding pixel} _{coordinate in a second image. For example, in a radius ^^ local cost volume, one must compare} _{a pixel to (2^^ + 1)2 reference pixels (e.g., for radius 3 there are 49 reference pixels in a 7x7} _{block). In classical, local cost volume approaches a version of this mapping might be:} _^ ^^^^( ^^) = _{^ ^} _{^^^^ − ^^^^+ ^^^ .} _{This approach effectively comprises a naive patch-wise comparison of the two images.} _{An inductive bias for structured data suggests that the above map is approximately sparse in a} _{basis, meaning that the data can be equivalently represented in a low-dimensional subspace} _{approximately logarithmic in the dimension of the ambient space.} _{Thus, a compressive encoding of the cost volume may look like:} _{^^ · ^^^^( ^^),} _{where ^^ is an appropriately chosen random matrix with ^^ ∈ R^^×^^ satisfying ^^ ∼ O(log ^^).}

_{From this, the present inventors have realised that it is possible to build a learned mapping} _{that replaces the above-described classical way, naive patch-wise comparison approach of} _{estimating local cost volumes that require large number of operations. The learned mapping} _{effectively bypasses the direct computation of the local cost volume and instead computes the} _{lower dimensional compressively encoded cost volume ^^ · ^^^^( ^^) directly. This compressively} _{encoded cost volume, or a representation thereof produced by a post-processing step, contains} _{substantially the same information as a traditional cost volume but this information is provided} _{in a lower dimensional representation that can still be passed to any subsequent, downstream} _{components of the AI-compression pipeline that relies on cost volumes in the usual way.} _{Examples of downstream components may include layers in the flow modules, the final} _{estimation of flow, the warping in or after the flow module, and so on. Given that these} _{downstream components are agnostic as to how they receive the cost volume information as} _{end to end training allows them to adapt to whatever form the information is received in, the} _{compressively encoded cost volume facilitates the estimation of image differences far more} _{efficiently than traditional cost volume calculation methods.} _{This approach is described hereinafter as Regional Kernel Absolute Deviation (RKADe) and} _{allows for any MAD operations in the AI-compression pipeline to be substituted by RKADe} _{operations which mitigate the memory interleaving issue described above in connection with} _{using MAD and naive patch-wise comparisons for cost volume estimation, and generally} _{provide a way to more efficiently estimate differences between images. The inventors have} _{found that run-time speed ups with very little additional optimisation were observed when} _{RKADe was tested across a number of widely available commercial CPUs, such as the Mac} _{M2 chip, and the Qualcomm Snapdragon chip. For example, in some instances, RKADe} _{results in a greater than 50% runtime reduction on widely-used standard Intel processors, as} _{compared to a purpose-built custom MAD cost volume implementation.} _{At a general level, a realisation of the present inventors is that, where a mapping (e.g. an} _{input, an operation applied to that input, and an output) is sparse in a basis (e.g. in the case} _{of a representation of an image such as a flow or residual between two images represented} _{by mostly zeros), the mapping can be almost identically represented in fewer dimensions.} _{This facilitates the implementation of that mapping in a simplified manner. In the case of} _{flow estimation in image sequences, there is often very little movement so cost volumes are} _{typically sparse in a transform of the spatial domain. This means that the calculation of cost} _{volumes representing the difference between two images can be substantially simplified. For} _{example, if a single pixel in one image is being compared to a pixel patch in a pixel radius of 3} _{around in a second image, this would entail 49 pixel comparison operations to obtain the cost} _{volume associated with that pixel using traditional methods. It turns out that the vast majority} _{of these pixel operations are related to redundant information when the input and/or output} _{are sparse as there is little or no difference. In such circumstances, the cost volume can more} _{efficiently be estimated by applying a fixed or learned map to the input to produce an identical} _{or almost identical cost volume. Applying a map, e.g. a linear map through a number of} _{convolution operations, is computationally efficient and fast and provides a lower-dimensional} _{representation of otherwise the same information that would have been provided in a cost} _{volume estimated using traditional methods.} _{Further, the compressive cost volumes that are computed using the RKADe approach come} _{with an additional, substantial run-time saving, because any downstream tensor operations} _{(TOPs) that take place are in the lower dimensions of the compressive cost volume compared to} _{the higher dimensions of the traditional cost volume; and none of the steps comprising RKADe} _{require grouped convolutions, slow memory access or array operations like de-interleaving.} _{An examplary pseudocode implementation of RKADe is provided below:} _{1) Receive input tensor x_t representing first image, and input tensor x_ref representing} _{second image.} _{2) Generate a feature map of x_t:} _{feature_map_x_t = feature_map(x_t)} _{3) Generate a feature map of x_ref:} _{feature_map_x_ref = feature_map(x_ref)} _{4) Generate a region map of x_ref from the feature map of x_ref:} _{region_map_x_ref = region_map(feature_map_x_ref)} _{5) Generate transform maps of x_t and region_map_x_ref to ensure same shape to facilitate} _{direct comparison:} _{transform_map_x_t = transform_map(x_t)} _{transform_map_region = transform_map(region_map_x_ref)} _{6) Calculate absolute difference:} _{RKADe_output = absolute_difference(transform_map_x_t, transform_map_region).} _{Figure 13 illustratively shows the above steps of RKADe. As shown in Figure 13, an illustrative} _{RKADe workflow on a toy example comprises four elements: a feature map F, a region map R,} _{a transform map T, and an absolute difference Δ.} _{The feature map F, defined by one or more layers comprising one or more filters defined} _{by a plurality of weights, operates as a linear embedding of the input tensor into a feature} _{space. It is a local embedding in the sense that it may comprise a 1x1 convolution. Efficient} _{implementations of 1x1 convolutions on a wide variety of commercial CPUs, GPUs and NPUs} _{are known to the skilled person. However, unique in the case of RKADe is that the feature} _{map operation filters may comprise random weights, and does not need to be trained (although} _{it is envisaged that it may be trained in some circumstances). The feature map operation} _{weights may be randomly distributed weights, and the operation relies on the favourable} _{properties of high-dimensional random embeddings to preserve local geometry. Accordingly,} _{the feature map F operation filters are instantiated using random weights with a normalisation} _{that makes it an isometry in expectation. That is, the feature map operation filters comprise} _{weights that apply a transformation that, on average, preserves geometrical distances of the} _{distribution it is applied to. In other words, the feature map operation preserves a norm of} _{the inputs in expectation. Thus, if a convolution defining a feature map is an isometry in} _{expectation, the norm of the input to which the convolution is applied is preserved in the output} _{(in a probabilistic sense). These fixed, random weights of F and the isometry in expectation} _{property of F effectively mean that, during inference, estimating differences between two} _{images comprises applying a convolution with the random weights that correspond to the} _{random weights the convolution was initialised with, rather than weights modified in some} _{way during subsequent training. An example of a suitable random distribution of weights of F} _{is any suitably initialised sub-gaussian distribution. Here suitably initialised means initialised} _{based on the shape of the input and/or output tensors.} _{The region map R operation comprises a composition of 3x3 convolutions, optionally with} _{intermediate non-linearities such as one or more ReLU maps or activations. The purpose of the} _{region map R is to entangle, in an output pixel, the information present in a local patch about} _{the same pixel in the input (reference) image. Here entangling information means combine} _{information. The "radius" of the local patch is determined by the number of convolutions in R} _{(e.g., 3 convolutions gives a patch radius of 7). Because R is comprised of 3x3 convolutions} _{and optionally simple non-linearities, it is possible to use known, efficient 3x3 convolution} _{implementations to run efficiently on a wide variety of commercial CPUs, GPUs and NPUs.} _{As above with the 1x1 convolutions, R is instantiated using a weight normalisation that makes} _{it an isometry in expectation.} _{The transform map T operation serves as a post-embedding or post-processing of the feature} _{embedding and the entangled patch information, permitting the effective comparison of the} _{two. This transform map T operation can be a simple 1x1 convolution, thereby being local,} _{linear, fast, and efficient to implement across a wide variety of commercially available CPUs} _{GPUs and NPUs. As above, T may be instantiated using a weight normalisation that makes} _{it an isometry in expectation. In some implementations, the weights of the transform map} _{T may correspond to those of the feature map F. Further, the number of input channels for} _{F can naïvely be set to any positive integer. The number of output channels of F may match} _{that of R and hence the number of input and output channels of R must be the same. The} _{number of input channels of T do not need to match its number of output channels. Note} _{for completeness that the number of input channels of F can be set naïvely because if the} _{mathematical relationships of RKADe are to hold in practice, it is envisaged that there is} _{sufficient dimensional relationship between the pixel radius (encoded by the number of layers} _{of R) and the number of output channels of F. This ensures the shape of the objects being} _{compared match when the absolute difference is subsequently calculated.} _{As shown in Figure 13, the feature map F is applied to a first input tensor representation} _{1300a of a first image, for example a current frame ^^^^ of a sequence of images, and a second} _{input tensor representation 1300b of a second image for example a previous frame ^^ref of the} _{sequence which may contain movement relative to the first image. Each input pixel of the first} _{input tensor representation 1300a, a comparison will be made with pixels at coordinates around} _{a predetermined radius of that coordinate in the second input tensor 1300b. An illustrative} _{toy example of a 1 pixel radius around a center pixel coordinate is indicated with the dotted} _{borders in Figure 7. The feature map convolution operation F is applied to the pixels of the} _{first input tensor 1300a and the associated patches of the predetermined radius in the second} _{input tensor 1300b. The region map R convolution operation is then applied to the output of} _{the feature map convolution operation on the second input tensor 1300b. A transform map T} _{convolution operation is then optionally applied to the intermediate outputs, before an absolute} _{difference is estimated, resulting in the RKADe cost volume tensor. It will be appreciated from} _{Figure 13 that none of the feature map F convolution, region map R convolution or transform} _{map T operation require grouped convolutions. Accordingly, the intermediate outputs may} _{be easily stored in contiguous memory blocks without requiring a large number of memory} _{read and write operations to interleave or de-interleave the data in memory. As a result, cost} _{volume estimations in the RKADe approach are substantially sped up compared to traditional,} _{naive patch-wise comparison approaches.} _{It is also noted that F, R and/or T may be kept fixed (e.g. F’s weights may be fixed and} _{randomly distributed between minimum and maximum values, as described above), or may} _{be trained. Keeping the maps fixed facilitates straightforward deployment by substituting} _{any naive patch-wise comparisons in AI-compression pipelines, (e.g. a large number of} _{MAD-based operations). However, when fixed, the expressivity of the maps may be reduced} _{and thus overall accuracy of this approach may be reduced. Conversely, training F, R and/or T} _{increases expressivity of RKADe but introduces greater complexity in training and inference} _{pipelines and can adversely affect training stability of an end-to-end trained AI compression} _{pipeline by virtue of the introduction of a further trainable element. The choice of using fixed} _{or trained F, R and/or T may accordingly depend on a complexity-accuracy-run-time trade off} _{for a given application. In exemplary embodiments, it is envisaged that the weights of R and T} _{are trained or learned, whereas the weights of F are random and fixed.} _{Combining these components together to get the absolute difference Δ that defines RKADe,} _{let ^^, ^^ ∈ R^^in×^^×^^ and let F and T be convolutions with kernel sizes [^^out, ^^in, 1, 1]} _and

_{comprising R have kernels with size} _{[^^out, ^^out, 3, 3]. Mathematically, one may write RKADe as} _{RKADe(^^, ^^) := |T (F^^ − R(F^^)) | .} _{Above, note that F is a linear map operating pixel-wise (i.e. performing the same operation on} _{each ^^in-dimensional pixel); that R also acts pixel-wise, but insofar as each pixel is represented} _{by a patch of a given radius that has been induced by the number of convolutions used to} _{construct R; and that T also acts linearly and pixel-wise on each ^^out-dimensional pixel. By} _{virtue of its construction using elementary convolutional components, RKADe runs on hardware} _{without the requirement of grouped convolutions, and thus solves the slow memory access} _{or array operations like de-interleaving of cost volume approaches such as naive patch-wise} MAD. _{As alluded to above, F, R and/or T may be individually or separately trainable, for exam-} _{ple by setting a "requires_grad_" PyTorch or similar flag in their respective code-level} _{implementations to permit backpropagation through them. However, it is envisaged that F} _{is generally kept fixed while R and/or T are trainable. In this case, they may be trained in} _{an end-to-end manner with the rest of the pipeline, whereby the weight values are simply} _{additional parameters, on top of the other parameters of the pipeline, that may be updated} _{during back-propagation. More specifically, it has been found that end-to-end training requires} _{no special auxiliary loss terms to guarantee stability during training. Indeed, advantageously,} _{the F, R and T maps are "plug-and-play" onto the rest of the AI compression pipeline during} _{training and subsequent inference.} _{Optionally, student-teacher training of weights of the (optionally F), R and T maps is also} _{effective and achieves good training stability out of the box without difficulty. Multiple} _{approaches to student-teacher training are envisaged. For example, the teacher may be set up} _{to push the teaching towards a fine-grained level that represents similar features to classical} _{cost volumes, or at a less granular level where the teacher comprises a flow network with MAD} _{cost volumes, and the student comprises a flow network with RKADe cost volumes, with the} _{loss based on the difference between these two. Alternatively, the teacher may be set up to} _{push the training towards some other objective and may be set up accordingly.} _{More generally, if there are multiple training stages of the compression pipeline (e.g. pre-} _{training and main training), the weights of F, R and/or T may be frozen at different times} _{during these stages by setting the "requires_grad_" flag appropriately at different training} steps. _{By way of illustrative example, consider a scenario where F is fixed, and R and T are learned,} _{R and T may be frozen with initialisation weights during the initial pre-training phase of the} _{compression pipeline before being unfrozen and trained during the main training phase. This} _{approach ensures that the weights of R and T are being updated based on a stronger training} _{signal from the rest of the neural networks of the compression pipeline in order to decrease} _{overall convergence time, thereby speeding up training.} _{As described above, the distribution (i.e. values) of the initialisation weights of F, R and/or} _{T may be random, based on some predetermined distribution, or based on prior knowledge} _{obtained from experimentation to provide a warm-started signal from the rest of the model at} _{the point when one or more of F, R and/or T unfreeze and become trainable. In one illustrative} _{example, the initialisation weights may be initialised using any appropriately normalised} _{sub-gaussian distribution producing a map that is an isometry in expectation (for example, a} _{Gaussian distribution, a truncated Weibull distribution, or a uniform distribution). In some} _{embodiments, it is envisaged that, during training, the property of being an isometry in} _{expectation is maintained as the weights of F, R and/or T as applicable are adjusted. This} _{property may be enforced using Jacobian regularisation, such as that described in concept 5} _{below. Alternatively, the isometry preserving property may only be retained upon initialisation,} _{and training will gradually eliminate that property as the weights converge to some final values.} _{Finally, it is also envisaged that optionally some of the weights of F, R, and/or T may be kept} _{fixed, while others may be trained, for example to avoid significant departure from the isometry} _{preserving property during training.} _{In an illustrative toy example, the weights of F, R and/or T may be initialised by generating} _{randomly distributed values uniformly distributed between a minimum value and a maximum} _{value, whereby the minimum and maximum value are based on a number of input or output} _{channels on which F, R and/or T are applied, based on a kernel size of F, R and/or T and/or on} _{a pixel radius across which the RKADe cost volume is to be calculated (e.g. a 3 pixel radius} _{resulting in a comparison against a 7x7 pixel patch centered on a given pixel coordinate). In} _{other words, the minimum and maximum values are based on the dimensions of the input} _{tensor. For example, an initialisation weight distribution can be defined as:} _{weight_distribution = uniform_noise(−^^, ^^),} _{where −^^ is the minimum value and ^^ > 0 is the maximum value of the uniform random} _{weight distribution defined by:} √^ ^_{^ =} input_channels o_{utput_channels · kernel_dimension_one · kernel_dimension_two .} _{A filter comprising weights distributed as above preserves norm in expectation of the input to} _{which it is applied. As described above, optionally, Jacobian regularisation may be applied} _{during training to ensure the isometry in expectation is preserved even as the weights are} _{updated. Alernatively, the weights may be free to lose this property if the training results in} _{them doing so. Of course, if any of F, R, and/or T are fixed, then the property will be preserved} _{naturally as the initalised weights do not change. It is particularly envisaged that F may be} _{fixed in this way, while the weights of R and/or T may be trained.} _{Optionally, the training regime may be implemented with one or more switches that specify} _{exactly when during training to freeze or unfreeze the trainable parameters of F, R and/or T} _{based on some predetermined one or more conditions (such as a number of iterations or a loss} _{threshold, or other conditions).} _{Where one or more of the F, R and/or T maps are fixed and, for example, there are no} _{non-linearities such as ReLUs in any of F, R and/or T, the weights are initialised as described} _{above e.g. with random weights for R and/or experimentally determined weights for F and T,} _{and there are no special loss or training considerations that need to be taken into account as} _{the maps are simply linear mappings.} _{Whilst concept 3 has been described in the context of using spatial comparisons to estimate} _{flow between two images in an image sequence as part of a flow-residual-based AI compression} _{pipeline, it is envisaged that it may be used anywhere where spatial comparisons are calculated,} _{both in and outside of the neural image and video compression domains, as is described in} _{more detail below in concept 4.} _{Concept 4: Regional Kernel Absolute Deviation (RKADe) for General Computer Vision} _{and Image Processing Tasks} _{Spatially comparing two images is widely performed across various technical domains.} _{Specifically, naive patch-wise comparisons (using any measure of similarity) are slow on} _{typical commercial hardware. For example, if naive patch-wise comparisons using MAD} _{operations are used and implemented as convolutions on existing commercial CPUs, GPUs} _{and NPUs, the patch-wise comparisons introduce a bottleneck to run time. This is particularly} _{problematic for technical domains where low-latency and fast run times are critical to} _{functionality. Accordingly run-time advantages are realised for any use case that is presently} _{implemented with patch-wise comparisons (using any measure of similarity), for example any} _{use cases currently implemented as a MAD-based patch-wise comparison.} _{In contrast, RKADe is faster than naive patch-wise comparison operations, in any computer} _{vision and/or image processing task because it is a compressive approximation of such a} _{patch-wise comparison.} _{A first example use case where run-time improvements may be realised with RKADe is in the} _{generation of bounding boxes around image patches where movement is to be detected.} _{Consider a first image at a first time and a second image temporally separated from the first} _{image. The objects in the second image have moved relative to their positions in the first} _{image. In computer vision tasks such as surveillance, satellite image comparisons, drone} _{navigation, and others, detecting such movement is a common task as it facilitates tracking of} _{objects across different views and across time. One approach to such detection is to generate a} _{bounding box around objects whose pixels differ between the first and second images.} _{One approach to generating such bounding boxes is to divide the first and second images into} _{grids, and to compare the pixel values in the first image to individual pixels or groups of} _{pixels in the second image. One way to make this comparison is to use MAD implemented as} _{convolutions. If the MAD for a given pixel or patch of pixels exceeds a threshold, that pixel or} _{patch of pixels may be identified as a movement-containing patch (whereas any that don’t exceed} _{the threshold may be identified as static patches). The boundaries of one or more bounding} _{boxes may then be generated that encompass some or all of these movement-containing patches} _{and used to identify the moving object across frames.} _{As described above, a naive patch-wise comparison implemented as convolutions are run-time} _{bottlenecks. Accordingly, applying RKADE approach from concept 3 to the bounding box} task: R_{KADe(^^, ^^) := |T (F^^ − R(F^^)) | .} _{where x is the first image, y is the second image, and T, F and R are as described in connection} _{with concept 3.} _{This approach to detecting moving objects facilitates the running of object detection locally} _{and in real-time on end devices such as onboard of a camera-based surveillance system, a} _{drone such as a small quadcopter, and other hardware platforms that typically do not have} _{access to powerful processor resources, or where power draw is at a premium due to hardware} _{constraints such as battery life.} _{More generally, the bounding box generation approach may also be used in image and video} _{compression pipelines including both traditional and AI-based compression pipelines. In this} _{case, it may be used to facilitate partial frame-skipping to further reduce the amount of data} _{that needs to be sent to reconstruct an image or image sequence.} _{For example, if objects in two images have hardly changed except for a small number of} _{pixels or pixel patches, the above-described bounding box generation approach may be used to} _{identify and extract those movement-containing patches. Only these movement-containing} _{patches then need to be compressed and transmitted in order for the full image sequence to} _{be accurately reconstructed. That is, on the decode side, the original image sequence can be} _{constructed efficiently by stitching together previously decoded static image patches with the} _{newly received movement-containing patches.} _{As described above, applying RKADe to estimating differences between pixel patches instead} _{of a naive patch-wise comparison facilitates a substantial run-time improvement.} _{Other non-limiting use cases where applying RKADe instead of naive patch-wise comparison} _{results in run time improvement include:} _{Image Registration: In medical imaging or remote sensing, images taken at different times or} _{from different sensors need to be aligned or "registered" to each other. Here the alignment or} _{registering of images to each other with naive patch-wise comparison (for example where MAD} _{is used as a similarity metric to align these images accurately by finding the transformation} _{that minimizes the average absolute intensity differences between them) can be replaced by the} _{present RKADe approach for run-time improvements.} _{Stereo Vision and Depth Estimation: When calculating depth from stereo images, naive} _{patch-wise comparisons using MAD are typically used to compare corresponding patches in} _{the left and right images. The disparity (difference in horizontal position) that minimizes the} _{MAD is often chosen as the correct match, which is then used to compute depth information.} _{Here, the replacement of the naive, MAD-based patch-wise comparison results in run-time} _{improvements in calculating depth.} _{Template Matching: In object detection and computer vision, template matching involves} _{sliding a template image over a target image to find the region that best matches the template.} _{A naive, patch-wise comparison, MAD-based is typically used as a measure to find the location} _{where the template and the target image have the least absolute difference, indicating a potential} _{match. Again, replacement of this approach with RKADe results in faster matching times.} _{Noise Reduction: In image denoising, MAD can be used to compare the local neighborhood} _{of pixels. Filters like the median filter or adaptive filters use MAD to determine the level of} _{noise in a local patch and to adjust the filtering strength accordingly to reduce noise while} _{preserving details. This initial determination of noise levels in a local patch can be achieved} _{faster by applying the RKADe approach.} _{Quality Assessment: For quality control in manufacturing, naive patch-wise comparisons are} _{be used to compare images of a product against a standard reference image. Differences beyond} _{a certain threshold can indicate defects or deviations from the desired quality. Again, this may} _{instead be implemented with RKADe to provide run time speed ups and the facilitation of} _{running quality control algorithms on edge devices that do not have significant computing} power. _{Change Detection: In satellite imagery analysis, naive patch-wise comparison can be used to} _{detect changes over time by comparing pixel intensities of the same location across different} _{dates. This is useful in monitoring urban development, deforestation, or the effects of natural} _{disasters. The use of RKADe instead of a naive patch-wise comparison faciliates the running} _{of such change detection algorithms more quickly and thus allowing real-time change detection} _{on resource-constrained devices.} _{Photogrammetry: In reconstructing 3D models from 2D images, naive patch-wise comparisons} _{can be used to ensure that the matching of pixels across multiple images is accurate, which is} _{crucial for generating a reliable 3D representation. Again, the use of RKADe to replace the} _{naive patch-wise comparisons results in faster run times, allowing phogrammetry systems to} _{reconstruct 3D models in real time on smaller, resource constrained devices.} _{In each of these cases, the granular detail that naive patch-wise comparisons provide about} _{the differences between pixel intensities can be obtained more efficiently, faster, and on} _{resource-constrained devices by using an RKADe-based approach instead.} _{Concept 5: Jacobian penalty for temporal sequence modelling} _{Neural network training stability refers to how well convergence of a neural network’s learning} _{process progresses during training. Training instability manifests as large fluctuations in} _{learning performance, where, during training, the model’s loss and/or validation curves} _{significantly vary, or even fail to improve at all, despite using the same data and training} parameters. _{Some non-limiting factors that influence training stability include the choice of the optimization} _{algorithm (e.g. stochastic gradient descent, momentum, adagrad, adam, and so on), the} _{learning rate, the initialization of network weights, the network architecture, the quality and} _{pre-processing of the input data and so on.} _{Traditionally, to improve training stability, techniques such as batch normalization, gradient} _{clipping, regularisation, and hyper-parameter tuning are applied. Finding an optimum approach} _{using these techniques requires burdensome experimentation, hyper-parameter sweeps and} _{ablation studies because, in many cases, an optimum approach to achieving training stability} _{for one type of model, architecture, and data set may only work on that model, architecture} _{and data set.} _{A known regularisation technique is to introduce a Jacobian regularisation or penalty term to} _{the loss function. A Jacobian regularisation term or penalty in the context of training neural} _{networks is a method used to control or influence the behavior of the model by regularizing} _{its sensitivity to input changes. The Jacobian matrix represents the partial derivatives of the} _{model’s outputs with respect to its inputs, effectively capturing how changes in the input affect} _{changes in the output. In matrix form, these partial derivatives are the network’s Jacobian} _{matrix. In order to control the behaviour of the loss function, a norm of the Jacobian matrix is} _{calculated and added to the loss function. Thus if the input causes the network output to change} _{significantly, the partial derivatives will have high values and the norm of the Jacobian matrix} _{will be high, thus the penalty added to the overall loss will be high. As training progresses} _{and the network learns to minimise the overall loss, it learns weights which avoid producing a} _{Jacobian matrix which, when the norm is calculated, produces a high value.} _{In machine learning generally, temporal sequence modelling is a challenging problem. In} _{the context of AI-based image and video compression, this challenge manifests itself in the} _{compression of image or frame sequences of a video. In low-motion sequences where there is} _{significant information redundancy across frames, the networks of an AI-based compression} _{pipeline such as that of Figure 3 and Figure 14, would learn to produce latent representations} _{of inputs that contain only minimal information and are highly compressible and then re-use} _{information form previously decoded frames when reconstructing the input. Conversely, in} _{high-motion sequences the latent representations contain more information. It turns out that} _{teaching these behaviours is extremely challenging.} _{For example, the networks of a pipeline that performs well on high-motion frame sequences} _{will not perform well on low-motion frame sequences and vice versa. In general terms, this} _{can be understood as the networks not having learned that when two images ^^^^−1 and ^^^^ of a} _{sequence are the same or similar, they can produce a substantially empty latent representation} _{to encode into the bit stream when compressing ^^^^ because all the information from ^^^^−1 can be} _{re-used. Indeed it turns out that flow-residual architectures such as that shown in Figure 3 and} _{Figure 14 struggle to perform well on low-motion frame sequences and this can be attributed} _{to the challenge of modelling the temporal sequence of frames.} _{Present concept 5 is directed to solving this problem by introducing a special type of Jacobian} _{penalty term into the loss function.} _{In more detail, let ^^ ≥ 1 be an integer. Denote by S^^−1 the (^^ − 1)-sphere. Denote by [^^] the} _{ordered set {1, 2, .. . , ^^}.} _{Let ^^ := ( ^^ (1) , .. . , ^^ (^^)) : R^^ → R^^ be an almost-everywhere continuously differentiable}

_{X ⊆ R^^ is compact. The directional derivative of any} _{component function ^^ (^^) in the (unit) direction ^^ ∈ S^^−1 at a point ^^ ∈ R^^ is given by} _{^^ ^^ (^^) (^^)}

_{(^^ + ^^^^) − ^^ (^^)} _{^^ [ ^^ (^^)] (^^) := lim} ^^↘0 ^^ _{When ^^ is a standard basis vector ^^( ^^) we call ^^^^ [ ^^ (^^)] (^^) the partial derivative of ^^ (^^) (in the} _{direction ^^( ^^)) (at the point ^^)}

^_{^ ^ (^^)} _{^^ [ ^^ (^^) (^^) ( ^^) ^} _{^^ ] (^^) = ^∇ ^^ (^^), ^^ ^ =} _^^^^ . _{The Jacobian matrix of ^^}

_{whose (^^, ^^)-th element is} _{^^ ^^ ^^) := ^^ (^^)} _{^^ [ ^^ ] ( ^^} _{, (^^, ^^) ∈ [^^] × [^^] .} _^^^^

_{Next, define the fixed point set of a mapping ^^ restricted to a (sub-)space X by} _{Fix ^^ := {^^ ∈ X : ^^ (^^) = ^^}.} _{Define a (discrete, Markovian) temporal sequence model as a sequence of tuples (^^^^ , ^^^^ , ^^^^)^^} _{where ^^ ≥ 0 is an integer and where ^^^^ := ^^^^ (^^^^ , ^^^^−1, ^^^^−1). We accommodate the case ^^ = 0 as} _{^^0 := ^^0(^^0). Applying this approach to image and video compression, ^^^^ may be a frame at} _{time ^^, ^^^^−1 may be a frame at time ^^ − 1, ^^^^−1 may be a previously reconstructed frame from} _{time ^^ − 1, and ^^ may be the function corresponding to the neural networks of the AI-based} _{compression pipeline that we are training. More specifically, the special ^^ = 0 case may} _{correspond to encoding/decoding an I-frame with no dependency on a frame at a different} _{time, whereas all other times correspond encoding/decoding a P-frame or B-frame from time ^^} _{that have some dependency on a frame from a different time ^^ − 1.} _{Now, assume that ^^1 = ^^ ^^ for all ^^ ≥ 2, and define I := ^^0 and P := ^^1. Take for example the} _{temporal sequence} _{(^^0, ^^^^ , ^^^^)^^ = (^^0,I, ^^0), (^^0, P, ^^1), (^^0, P, ^^2), .. . ,} _{Observe that if P were a function of ^^^^−1 only, then P would (favourably) exhibit perfect} _{recovery if we had ^^^^−1 = ^^0 and it held that} _{P(^^^^−1) = P(^^0) = ^^^^ = ^^0.} _{In other words, if P were a function of ^^^^−1 only, then we would desire that the ground truth} _{data ^^0 be a fixed point of P. Applying this specifically to image and video compression, if P} _{were a function of ^^^^−1 only (i.e. there is no new information in the current frame ^^^^ compared} _{to the previously decoded frame ^^^^−1), then we would want the networks of the pipeline to} _{perfectly reproduce the previously decoded frame ^^^^−1 as its output when encoding/decoding} _{input ^^^^ .} _{Our wish accordingly is to invoke known results on fixed point existence theorems that rely on} _{smoothness characterisations of the relevant mapping. However, in the present context the} _{map P is generally non-smooth and is possibly a composite representation of mappings, some} _{of which have discrete domain, destroying applicability of such smoothness characterisations.} _{Concept 5 is directed to overcoming this problem by introducing the assumption that P is a} _{composition of two maps: D ◦E such that D is almost-everywhere continuously differentiable} _{and whose domain contains a compact Borel set of positive measure. One could think of} _{E as an “encoder” that maps a signal to its compressed or latent representation; and D as a} _{“decoder” that maps a compressed or latent representation to a corresponding signal in the} _{output domain.} _{With this in mind, define the auxiliary map} _{Q : ( ^̂^^^ , ^^^^−1) ↦→ ( ^̂^^^+1, ^^^^) := (E( ^̂^^^ , ^^^^), D( ^̂^^^+1, ^^^^−1)).} _{In the setting of the exemplary temporal sequence, one has (^^^^ , ^^^^−1, ^^^^−1) = (^^0, ^^0, ^^^^−1) ∈} _{FixQ if ^^^^−1 = ^^^^ . Morevoer, ^^ exhibits favourable recovery if ^^^^−1 = ^^^^ ≈ ^^0. In other words,} _{if we apply the defined auxiliary map to a given input, the following behaviour is observed:} _{if a current frame ^^^^ has no motion or new information relative to a previous frame ^^^^−1, the} _{reconstructed current frame ^^^^ correctly corresponds exactly to the reconstructed previous} _{frame ^^^^−1, satisfying our requirement that the networks should just re-use previous information} _{where there is no motion between frames to perfectly reconstruct the previous information.} _{Thus we desire a method of promoting that Q admit such a fixed point and in particular to have} _{(^^, ^^, ^^) ∈ FixQ for all ^^ ∼ D where D is some data distribution of interest that is supported} _{on X.} _{Under suitable conditions it is guaranteed that if the Lipschitz constant of a function is} _{sufficiently small, that function admits a fixed point on a given domain (e.g., the Banach fixed} _{point theorem and Browder fixed point theorem are both admitted by settings relevant to the} _{current continuous-space framing of the question under present consideration). We thereby} _{make the following Ansatz: an appropriately regularised and sufficiently over-parametrised} _{function P^^ := E^^ ◦ D^^ with weights ^^ admits a configuration ^^0, auxiliary mapping} _{Q := and sufficiently small parameter ^^ ≥ 0 (that may depend on parameter space} _{complexity or other complexity factors) such that LipQ ≤ 1 and, thereby} E ^^∼D _{^^ ((^^, ^^, ^^), ^^^^^^Q) < ^^,} _{where dist(^^, A) is the distance (e.g., Euclidean distance) of the point ^^ to the set A. Observe} _{that ^^ encodes how well the “discovered” fixed point manifold Fix Q is adapted to the (unknown} _{or incompletely characterised) data distribution D. In other words, P exists and can be learned} _{such that its auxiliary mapping has a small Lipschitz constant and the fixed point set of the} _{auxiliary mapping, induced thereby, significantly coincides with the points from the data} _{distribution of interest. Thus, the training objective becomes the task of learning weights} _{that produce a fixed point set for mappings that act on low-motion input frames, effectively} _{producing a network that acts like an identity operator on a previously decoded frame when} _{the current frame is substantially the same as or similar to the previously decoded frame.} _{Applying all of the above to temporal sequence modelling, such as that which occurs in (for} _{example) AI-based image or video compression pipelines, our subsequent Ansatz is that this} _{parameter ^^0 can be approximated with a suitable training regime. That is, the desired fixed} _{point manifold can be well approximated by a deep-learning based temporal sequence model} _{permitting stable long-term temporal dependence modelling with recurrent implementations.} _{In general terms, there exists a set of weights for which the network acts as an identity operator} _{when two input frames ^^^^ and ^^^^−1 are identical.} _{Recall that an ^^-Lipschitz function ^^ that is differentiable at a point ^^0 satisfies, for a unit} _{direction vector ^^:} _{^^^^ [ ^^ ] (^^0) ≤ ^^.} _{In particular, for simplicity assume that the domain X is compact and that ^^ is everywhere} _{continuous and almost everywhere continuously differentiable on X. Then, there exists a} _{point ^^ ∈ X such that ^^ = sup^^∈S^^−1 ^^^^ [ ^^ ] (^^) = ∥^^ [ ^^ ] (^^)∥ (where ∥ · ∥ denotes the operator} _{norm). Accordingly, when ^^ is sufficiently well behaved, controlling the operator norm of the} _{Jacobian of ^^ controls its Lipschitz constant, in turn permitting the opportunity for obtaining a} _{function ^^ with a fixed point manifold, and allowing us to train the network of the AI-based} _{compression pipeline corresponding to this function ^^ by including a Jacobian penalty in the} _{loss function based on the auxiliary map defined above.} _{That is, a Jacobian penalty calculated from an auxiliary function that is based on one or more} _{of the components of the pipeline but modified so that (1) the input space matches the output} _{space (that is, the input variables to the auxiliary function have the same form, shape and} _{dimensions as the outputs) and (2) the input latent is passed through and returned by the} _{function without amendment.} _{To illustrate this at a practical level, consider the residual decoder of the compression pipeline} _{of Figure 14. Our goal is to encourage the residual decoder to learn weights that perfectly} _{reconstruct a previously decoded frame when the current frame is substantially identical (i.e.} _{no motion between the frames of the sequence). The residual decoder of the actual pipeline} _{receives as input a latent tensor, an optical flow information tensor (e.g. in the form of a} _{warped current frame), and outputs a reconstructed image. In this case, the input space does} _{not match the output space (i.e. the number of variables, the forms, shapes and dimensions of} _{the inputs and outputs are different because the output is only the reconstructed frame and} _{not the latent tensor). Thus, if we calculate a Jacobian penalty based on this (i.e. based on} _{or approximated from a matrix of partial derivatives of how output changes with respect to} _{changes in the input), and add this Jacobian penalty to the loss function, the weights will} _{not generally converge to values that produce our goal of a residual decoder that perfectly} _{reconstruct the previously decoded frame when the current frame is substantially identical.} _{Instead, in this illustrative example, we construct an auxiliary function that: (i) is based on} _{the residual decoder i.e. a function that operates identically to the residual decoder during a} _{forward pass and accordingly similarly receives as inputs a latent tensor and an optical flow} _{information tensor in the form of a warped input image, but (ii) now not only returns the} _{reconstructed image, but also returns as output the original input latent tensor. Thus we have} _{a function where the input space (latent tensor and warped input image) matches the output} _{space (latent tensor and reconstructed image). When we subsequently calculate the Jacobian} _{penalty from this function, the latent tensor will appear as both a variable considered an input} _{and a variable considered an output when the partial derivatives of the Jacobian matrix are} _{being estimated or approximated. In layman’s terms, the effect of this is the moderation of} _{significant changes to the latent tensor as sequences progress. In more formal terms, the} _{above-described mathematical relationships apply and the weights converge to values during} _{training that, during inference, exhibit the behaviour of perfectly reconstructing a previously} _{decoded frame when the current frame is substantially identical.} _{Accordingly, training an AI compression pipeline by using a Jacobian regularisation term} _{based on an auxiliary function such as this, in which the input space matches the output} _{space, encourages the learning of weights in which the networks are better able to reconstruct} _{sequences of frames in a temporally consistent way.} _{This illustrative example is given in the pseudocode below:}

_{Algorithm 1 AI-based compression network training with residual decoder auxiliary mapping Jacobian penalty} Inputs: T_{raining dataset X, learning rate ^^, regularization parameter ^^, number of epochs ^^ ,} _{network architecture ^^^^} _{Initialize network parameters ^^} _{for epoch = 1 to ^^ do} _{for each batch (^^^^−1, ^^^^ ) in X do} _{Forward pass to compute predictions ^^^^ = ^^^^ (^^^^−1, ^^^^ ), including ^̂^^^} _{Forward pass through auxiliary function to compute Jacobian penalty ^^aux ( ^̂^^^ , ^^^^−1 ↦→ ^̂^^^ , ^^^^ )} _{Compute loss Loss = ^^ + ^^^^ + ^^aux} _{Backward pass to compute gradients ∇^^Loss} _{Update parameters with optimizer O: ^^ ← O(^^, ∇^^Loss, ^^)} _{end for} _{Optionally evaluate on validation set} _{end for} _{That is, a training data set ^^. A learning rate ^^, regularisation parameter ^^, and a number of} _{training steps or epochs ^^ is selected. The network architecture of ^^^^ is defined, for example} _{as shown in Figure 3 and Figure 14. The network parameters ^^ are randomly initialised} _{and then the training loop is started. For each batch (^^^^−1, ^^^^) in the training data ^^, the} _{forward pass is computed for the network being trained ^^^^ , as well as the auxiliary function.} _{A Jacobian penalty ^^^^^^^^ ( ^̂^^^ , ^^^^−1 ↦→ ^̂^^^ , ^^^^) is then estimated as described above, and the total} _{^^^^^^^^ is calculated by combining a distortion term ^^, a rate term ^^, and the estimated Jacobian} _{penalty. The backwards pass is then performed to compute gradients based on the loss, and} _{the parameters ^^ are optimised using the optimiser, such as stochastic gradient descent SGD,} _{or some other known optimiser. Optionally, a validation loss can be calculated and the training} _{loop is repeated until the predetermined number of steps or epochs ^^ have been calculated, or} _{some other criteria have been reached. The learning rate, batch size, and or number of epochs} _{may be optimised during training, for example using a learning rate scheduler or some other} _{hyperparameter optimisation method. More generally, the hyperparameters may be optimised} experimentally. _{As described above, the Jacobian penalty ^^aux( ^̂^^^ , ^^^^−1 ↦→ ^̂^^^ , ^^^^) is based on an auxiliary} _{mapping function that includes the latent tensor ^̂^^^ on both sides of the mapping so that the} _{input space matches the output space, thus resulting in convergence to a set of weights that} _{exhibit the desired temporally stability for frame sequences.} _{Note that, whilst the Jacobian penalty term is shown to be based on the mapping ( ^̂^^^ , ^^^^−1 ↦→} _{^̂^^^ , ^^^^), it is envisaged that, each of these variables may be processed in some way without} _{the general applicability of the above-described general mathematical properties of} _{the input space matching the output space. For example, where the architecture of ^^^^ includes} _{a flow module, and a flow estimation and/or warped previously decoded image is used in the} _{place of ^^^^−1, then the mapping may be based on this warped image e.g. ^^^^−1,warped, and so on.} _{Finally, it will be appreciated that, whilst the above example has been described in the context} _{of the residual decoder in that the auxiliary function from which the Jacobian penalty is} _{calculated is based on the residual decoder, the above-described principles can be extended to} _{be more generally applicable to any of the networks in an AI-based compression pipeline in} _{which encouraging temporal consistency is challenging.} _{For example, the algorithm described above can be extended to introduce a Jacobian penalty} _{term based on any input to output mappings where one of the terms is included as both an input} _{variable and output variable. This generalisation is illustrated in the pseudocode provided} below:

_{Algorithm 2 AI-based compression network training with general auxiliary mapping Jacobian penalty} Inputs: T_{raining dataset X, learning rate ^^, regularization parameter ^^, number of epochs ^^ ,} _{network architecture ^^^^} _{Initialize network parameters ^^} _{for epoch = 1 to ^^ do} _{for each batch (^^^^−1, ^^^^ ) in X do} _{Forward pass to compute predictions ^^^^ = ^^^^ (^^^^−1, ^^^^ ), including ^̂^^^} _{Forward pass through auxiliary function to compute Jacobian penalty ^^aux (^^,^^ ↦→ ^^, ^^)} _{Compute loss Loss = ^^ + ^^1^^ + ^^2^^aux} _{Backward pass to compute gradients ∇^^Loss} _{Update parameters with optimizer O: ^^ ← O(^^, ∇^^Loss, ^^)} _{end for} _{Optionally evaluate on validation set} _{end for} _{Similarly to the specific residual decoder example, a training data set ^^ is provided. A learning} _{rate ^^, regularisation parameter ^^, and a number of training steps or epochs ^^ is selected. The} _{network architecture of ^^^^ is defined, for example as shown in Figure 3 and Figure 14. The} _{network parameters ^^ are randomly initialised and then the training loop is started. For each} _{batch (^^^^−1, ^^^^) in the training data ^^, the forward pass is computed for network being trained} _{^^^^ , as well as the auxiliary function. A Jacobian penalty ^^^^^^^^ (^^,^^ ↦→ ^^, ^^) is then estimated} _{as described above, and the total ^^^^^^^^ is calculated by combining a distortion term ^^, a rate} _{term ^^, and the estimated Jacobian penalty. Some or all of the terms may be regularised by} _{some constant ^^1 and/or ^^2. The backwards pass is then performed to compute gradients} _{based on the loss, and the parameters ^^ are optimised using the optimiser, such as stochastic} _{gradient descent SGD, or some other known optimiser. A validation loss then optionally be} _{calculated and the training loop is repeated until the predetermined number of steps or epochs} _{^^ have been calculated, or some other criteria has been reached. The learning rate, batch} _{size, and or number of epochs may be optimised during training, for example using a learning} _{rate scheduler or some other hyper parameter optimisation method. More generally, the hyper} _{parameters may be optimised experimentally.} _{In this general case, ^^ , ^^ and ^^ can be any variables or set of variables from an AI-based} _{compression pipeline including, but not limited to one or more of: a current input image: ^^^^ ,} _{a reference input image: ^^^^−1 or ^^^^+1, a reference previously reconstructed image (warped} _{or unwarped): ^^^^−1, ^^^^ or ^^^^+1, ^^^^−1,warped, ^^^^,warped or ^^^^+1,warped, a flow: ^^^^ , and/or any latent}

_{on, including in using any of} _{these variables in any combination in upsampled and downsampled spaces as applicable.} _{For example, for low-motion type frame sequences, such as static nature scenes or CCTV feeds} _{where it is typically expected that flow will not vary significantly between frames and where} _{networks are failing to converge to a set of weights where the desired behaviour to produce} _{flow information that varies little between frames, we can construct an auxiliary mapping from} _{which to estimate a Jacobian penalty by using, for example:} _{^^^^^^^^ ( ^̂^ ^^} _{^^ , ^^^^−1 ↦→ ^̂^} ^^ ^_{^ , ^^^^)}

T_{hat is, the reconstructed flow latent is provided on both sides of the auxiliary mapping} _{from which the Jacobian penalty is calculated. When the loss function then incorporates} _{this Jacobian penalty term, the network weights converge to a set of values where flow is} _{encouraged to be reconstructed in a way that varies little between across a sequence of frames.} _{Effectively the networks are learning the set of weights that have a fixed point for the desired} _{auxiliary mapping behaviour.} _{A similar approach can be taken for any and all of the example variables referred to above} _{to encourage the networks of the AI-based compression pipeline to learn how to behave in a} _{temporally consistent way for frame sequences.} _{It is also envisaged that multiple different Jacobian penalties based on such mappings may be} _{introduced into the loss term, for example ^^^^^^^^1 , ^^^^^^^^2 , ..., ^^^^^^^^^^ as applicable to encourage the} _{learning of temporally consistent behaviour by any number of components of the AI-based} _{compression pipeline. Further, any weighted norm of any such Jacobian penalty or penalties} _{may be incorporated into the AI-based compression pipeline. Specifically, a weighted norm} _{of the Jacobian multiplies the components of the matrix using weights, computed in some} _{manner relevant to the task at hand, such that the resulting "weight and norm" operation still} _{satisfies the mathematical definition of being a (quasi) norm. Such weights may be computed} _{for example as a function of the amount of motion information that is present in the image; or} _{according to metrics that define the presence of occlusion between two frames. The motion} _{information may be captured indirectly in the flow information estimated by the flow module} _{of the pipeline, or by some other measure such as, for example, a direct pixel difference} _{calculation between the pixels of two or more images.} _{Moving onto practical considerations, given the many different variables and mappings from} _{which the Jacobian penalty described above may be estimated, the above-described approach} _{can increase network training times significantly, for example 30% or more. To help address} _{this, it is envisaged that the presently described Jacobian penalties may be estimated using the} _{following approach.} _{It will be understood that the Frobenius norm of a square matrix ^^ ∈ R^^×^^ also controls} _{its operator norm. Further recall that Hutchison’s trace estimator (where ^^ ∈ R^^ with} ^^ _i _{^^ ∼id} _{N(0, 1)) is given by} [ ] [ ∑ ^ℓ^

_{tr(^^) = ^^E} ^^⊤^^ _{= E} ^^⊤ ] 1 ^^^^ ≈ (^̂^^{( ^^)})^⊤^^^̂^^{( ^^)} ^^ ^^=1 w_{here ^̂^( ^^) , ^^ = 1, .. . , ℓ are ℓ independent realisations of ^^.} _{Now suppose that ^^ := ^^⊤ [ ^^ ] (^^)^^ [ ^^ ] (^^) and observe that the above equation provides an} _{ℓ-sample estimate of the Jacobian-vector product of a function ^^ .} _{Thus, in the standard set-up for minimising a loss function using stochastic gradient descent,} _{suppose on iteration ^^ that one has batch ^^ (^^) and a function ^^ (^^; ^^ (^^)) with parameters ^^ (^^) .} _{Set ℓ = 1 to take a 1-sample approximation for the trace estimate, i.e.,} _{tr(^^) = tr(^^⊤^^) = ∥^^∥2 ≈ ^^⊤^^⊤ [ ^^ (^^) (^^) (^^) (^^)} _{^^ (·;^^ )] (^^ )^^ [ ^^ (·; ^^ )] (^^ )^^} _Now

(_^^) _{^^ [ ^^ (·; ^^ ))] (^^ (^^))^^ = ^^ [ ^^ ( (^^) (^^) ^^ (^^ + ℎ (^^) (^^) (^^)} _{(^^ ^^; ^^ ) − ^^ (^^ ; ^^ )} _{^^ ·;^^ )] (^^ ) ≈} _for

_{step size} _{estimation strategy are here omitted for brevity.} _{Applying the above formal explanation to the specific example of calculating Jacobian penalties} _{for an AI-based compression pipeline loss function, recall that the Jacobian penalty may be} _{calculated by estimating the partial derivatives of the network’s outputs with respect to its inputs,} _{which together form a Jacobian matrix. The penalty may be estimated by calculating the norm} _{of this matrix (or the trace of a related matrix - as is known in the art). However, calculating} _{the norm (or trace, as applicable) of the Jacobian matrix is computationally very expensive.} _{Instead we use an approach based on Hutchison’s trace estimator and finite differences. In} _{the specific case of AI-based compression pipelines, it turns out that we can get a good trace} _{estimate by making a 1-sample approximation. Even though the sample size is only 1, the} _{present inventors have found that the estimate is nevertheless accurate enough to estimate a} _{Jacobian penalty that encourages the learning of weights with the desired behaviours described} above. _{Accordingly, the 1-sample approximation using finite differences facilitates the estimation of a} _{Jacobian penalty in a way that is significantly faster than traditional methods and facilitates} _{training with multiple Jacobian penalties without significantly increasing training times. That} _{is, the time attributed to estimating the Jacobian penalty is an insignificant fraction of the} _{overall training time.} _{This in turn enables training with Jacobian penalties applied to multiple components of} _{the pipeline to encourage temporally stable behaviour while keeping overall training times} _{substantially the same thereby reducing overall cost per training run.} _{Another practical consideration that is now discussed is how to ensure that the estimated} _{Jacobian penalty(ies) are not over-penalising the loss when training frame sequences where} _{its introduction is not conducive to learning a suitable set of weights. For example, consider} _{again the Jacobian penalty described above that is based on an auxiliary function based on} _{the residual decoder. As described above, the Jacobian penalty in this case encourages the} _{learning of a set of weights that allow the near-perfect reconstruction of a previously decoded} _{image when there is little or no movement between frames.} _{However, when this approach is applied to very high-motion frame sequences, the behaviour} _{we are encouraging with the Jacobian penalty results in a drop in accuracy.} _{To overcome this issue, the present inventors have realised that the Jacobian penalty described} _{above can itself be regularised based on a property of the frames of the sequence being trained} _{on. That is, the Jacobian penalty may be made "motion-aware" by weighting it according to a} _{property of the frames of the sequence being trained on such as how much movement there is} _{between frames. This movement may be captured indirectly in the Jacobian matrix from which} _{the Jacobian penalty is calculated and accordingly the penalty may comprise a weighted norm} _{^^(^^aux) where the mapping ^^ may comprise e.g. Frobenius norm, a spectral norm, or any} _{other norm. Whereby, for example, the the weighting may scale the Jacobian penalty based on} _{the combined strength of all the partial derivatives it contains, which will be higher in high} _{motion frame sequences.} _{Alternatively, the movement may be captured directly and the mapping that encodes the} _{weighting ^^ of the Jacobian may be based on e.g. pixel differences, MSE, or some other} _{measure of differences between the frames at time ^^ and some other time ^^ − 1.} _{At a general level, the idea is that if the motion between two frames ^^^^ and ^^^^−1 is large, then} _{we want the Jacobian penalty described above to be dampened to a lower value. Conversely,} _{if the motion between two frames, two corresponding pixel regions within a frame, or any} _{other information between which relative motion may exist, is small, then we don’t want to} _{dampen the Jacobian penalty associated with those frames/regions/pixels and so on at all and} _{instead allow it to have the effects described above to learn the desired fixed point weights for} _{low-motion frames. This regularisation of the Jacobian penalty term may comprise a weighted} _{norm or some other weighting and is illustrated in the pseudocode below:}

_{Algorithm 3 AI-based compression network training with motion-aware auxiliary mapping Jacobian penalty} Inputs: T_{raining dataset X, learning rate ^^, regularization parameter ^^, number of epochs ^^ ,} _{network architecture ^^^^} _{Initialize network parameters ^^} _{for epoch = 1 to ^^ do} _{for each batch (^^^^−1, ^^^^ ) in X do} _{Forward pass to compute predictions ^^^^ = ^^^^ (^^^^−1, ^^^^ ), including ^̂^^^} _{Forward pass through auxiliary function to compute Jacobian penalty ^^^^^^^^ (^^,^^ ↦→ ^^, ^^)} _{Compute loss Loss = ^^ + ^^1^^ + ^^2^^(^^aux)} _{Backward pass to compute gradients ∇^^Loss} _{Update parameters with optimizer O: ^^ ← O(^^, ∇^^ ^^^^^^^^, ^^)} _{end for} _{Optionally evaluate on validation set} _{end for} _{Similarly to the specific residual decoder example, a training data set ^^ is provided. A learning} _{rate ^^, regularisation parameter ^^, and a number of training steps or epochs ^^ is selected.} _{The network architecture of ^^^^ is defined, for example as shown in Figure 3 and Figure 14.} _{The network parameters ^^ are randomly initialised and then the training loop is started. For} _{each batch (^^^^−1, ^^^^) in the training data ^^, the forward pass is computed for the network being} _{trained ^^^^ , as well as the auxiliary function. A Jacobian penalty ^^^^^^^^ (^^,^^ ↦→ ^^, ^^) is then} _{estimated as described above, and the total Loss is calculated by combining a distortion term} _{^^, a rate term ^^, and the estimated Jacobian penalty that in this case is weighted by ^^. As} _{above, the loss term may be regularised using e.g. ^^1 and/or ^^2 The backwards pass is then} _{performed to compute gradients based on the loss, and the parameters ^^ are optimised using the} _{optimiser, such as stochastic gradient descent SGD, or some other known optimiser. Optionally,} _{a validation loss may be calculated and the training loop is repeated until the predetermined} _{number of steps or epochs ^^ have been calculated, or some other criteria have been reached.} _{The learning rate, batch size, and or number of epochs may be optimised during training, for} _{example using a learning rate scheduler or some other hyper parameter optimisation method.} _{More generally, the hyper parameters may be optimised experimentally.} _{As explained above, the Jacobian penalty here is weighted by a function ^^, where ^^ is defined} _{to act element-wise on the object returned by a Jacobian-vector product between the Jacobian} _{matrix and some random vector (in practice a tensor) such that the weighted norm induced by} _{^^ satisfies the mathematical definitions of a quasi-norm or pseudo-norm.} _{A further difficulty that arises when training using the above-described Jacobian penalty is its} _{interactions with different phases of training and different training schedules. For example,} _{the present description has been framed in the context of training on sequences of frames,} _{the specific number of frames in a given sequence or "group of pictures" (GOP) used during} _{training can vary. Note that a GOP is typically considered to be an I-frame and ^^ P- or} _{B-frames. Consider for example a training schedule that starts training on short GOPs of} _{5-6 frames, and then after a predetermined number of steps switches to training on longer} _{GOPs of 7 or more frames e.g. 8, 9, 10, 20, 30, 40, 50 frames or more. The Jacobian penalty} _{described herein is ideally suited for encouraging the networks to be performant on said longer} _{GOPs but can in some cases act as noise when performing the initial training on the shorter} _{GOPs. This may be, for example, because the temporal consistency is less of a problem for} _{shorter GOPs and so the Jacobian penalty actually weakens the strength of the training signal.} _{More generally, Jacobian regularisation serves as an inductive bias to improve the network’s} _{generalisation performance to sequences with GOP sizes that are significantly greater than} _{that seen in training. Indeed, if the network is only evaluated on sequences with the same} _{GOP seen in training, for example short GOP sequences, then temporal stability can be less} _{problematic. However, it follows that, for temporal stability arising solely from training on} _{different GOP sizes to be present in the wild, the training data necessary to achieve this would} _{have to contain equal numbers of samples of all GOP sizes distributed equally across all video} _{sequences and so on - something which is burdensome to obtain in real world settings. The} _{presently described approach accordingly facilitates obtaining the same effect but without} _{needing as complete a training data set. Indeed, it becomes computationally prohibitive to train} _{on very large GOP sizes so even if a complete training data set comprising large samples of all} _{GOP sizes encountered in the wild is available, it may not be possible to practically train on all} _{of the data. Accordingly, the present disclosure faclitates the generalisation of performance in} _{the sense of temporal stability for large GOP sizes — specifically those that are significantly} _{larger than what’s seen during training, irrespective of what GOP sizes are in the training data.} _{To overcome this problem, it is envisaged that the presently described Jacobian penalty term} _{may be introduced only at a predetermined time or times (e.g. in terms of number of training} _{steps or consequential to changing training frame sequence length) during a training schedule.} _{Further, the Jacobian penalty term may also be removed at a predetermined time or times for} _{similar reason or reasons as may be applicable for a given training schedule. This adaptive} _{approach thus enables the advantages of the above-described Jacobian penalty term to be} _{realised in a flexible way to fit in with any existing training schedules.} _{Figure 14 illustrates a further example of a flow-residual compression pipeline, such as that of} _{Figure 3, whereby the representation of flow information that the residual decoder receives} _{as input comprises a warped previously decoded image. As this architecture corresponds} _{generally to the flow-residual compression pipeline shown in Figure 3 it accordingly uses} _{the same reference numbers for corresponding features. The flow module 1400 is shown to} _{comprise a flow encoder 1401 that produces a latent representation of optical flow information} ^̂^ ^^ ^_{^ which is decoded by a flow decoder 1402 into a reconstructed flow ^^^^ which in turn is} _{used to warp a previously reconstructed image ^^^^−1 to produce ^^^^−1,^^ which in turn is fed} _{into the residual decoder 1413 as the representation of optical flow information. Thus the} _{residual decoder neural network 1413 uses a latent representation ^^^^ of the current frame} _{^^^^ , and the warped previously reconstructed image ^^^^−1 to produce the reconstructed current} _{frame ^^^^ . Thus, based on this architecture, the Jacobian penalty term described above may be} _{implemented by constructing an auxiliary function that copies the operations of the residual} _{encoder 1411 and/or residual decoder 1413 of the residual part 1410 including using the same} _{inputs as the residual decoder 1413, but the auxiliary function also returns as an output the latent} _{representation ^^^^ that was one of its inputs, effectively simply passing the latent representation} _{^^^^ 1412 directly through the function. The Jacobian penalty based on this auxiliary function} _{with the latent being both an input and an output has the desired mathematical properties to} _{encourage convergence to a set of weights that produce a network that behaves in a temporally} _{stable manner for sequences of frames.} _{While this specification contains many specific implementation details, these should be} _{construed as descriptions of features that may be specific to particular examples of particular} _{inventions. Certain features that are described in this specification in the context of separate} _{examples can also be implemented in combination in a single example. Conversely, various} _{features that are described in the context of a single example can also be implemented in 25} _{multiple examples separately or in any suitable sub-combination.} _{Similarly, while operations are depicted in the drawings in a particular order, this should not} _{be understood as requiring that such operations be performed in the particular order shown} _{or in sequential order, or that all illustrated operations be performed, to achieve desirable} _{results. In certain circumstances, multitasking and parallel processing may be advantageous.} _{Moreover, the separation of various system modules and components in the examples described} _{above should not be understood as requiring such separation in all examples, and it should be} _{understood that the described program components and systems can generally be integrated} _{together in a single software product or packaged into multiple software products.} _{The subject matter and the functional operations described in this specification can be} _{implemented in digital electronic circuitry, in tangibly-embodied computer software or} _{firmware, in computer hardware, including the structures disclosed in this specification and} _{their structural equivalents, or in combinations of one or more of them. The subject matter} _{described in this specification can be implemented as one or more computer programs, i.e.,} _{one or more modules of computer program instructions encoded on a tangible non transitory} _{program carrier for execution by, or to control the operation of, data processing apparatus.} _{Alternatively or in addition, the program instructions can be encoded on an artificially generated} _{propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that} _{is generated to encode information for transmission to suitable receiver apparatus for execution} _{by a data processing apparatus. The computer storage medium can be a machine-readable} _{storage device, a machine-readable storage substrate, a random or serial access memory device,} _{or a combination of one or more of them. The computer storage medium is not, however, a} _{propagated signal.} _{The term “data processing apparatus” encompasses all kinds of apparatus, devices, and} _{machines for processing data, including by way of example a programmable processor, a} _{computer, or multiple processors or computers. The apparatus can include special purpose} _{logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific} _{integrated circuit). The apparatus can also include, in addition to hardware, code that creates} _{an execution environment for the computer program in question, e.g., code that constitutes} _{processor firmware, a protocol stack, a database management system, an operating system, or} _{a combination of one or more of them.} _{A computer program (which may also be referred to or described as a program, software, a} _{software application, a module, a software module, a script, or code) can be written in any} _{form of programming language, including compiled or interpreted languages, or declarative or} _{procedural languages, and it can be deployed in any form, including as a stand alone program or} _{as a module, component, subroutine, or other unit suitable for use in a computing environment.} _{A computer program may, but need not, correspond to a file in a file system. A program can be} _{stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored} _{in a markup language document, in a single file dedicated to the program in question, or in} _{multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions} _{of code. A computer program can be deployed to be executed on one computer or on multiple} _{computers that are located at one site or distributed across multiple sites and interconnected by} _{a communication network.} _{The processes and logic flows described in this specification can be performed by one or more} _{programmable computers executing one or more computer programs to perform functions} _{by operating on input data and generating output. The processes and logic flows can also be} _{performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g.,} _{an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).} _{Computers suitable for the execution of a computer program include, by way of example,} _{can be based on general or special purpose microprocessors or both, or any other kind of} _{central processing unit. Generally, a central processing unit will receive instructions and data} _{from a read only memory or a random access memory or both. The essential elements of} _{a computer are a central processing unit for performing or executing instructions and one} _{or more memory devices for storing instructions and data. Generally, a computer will also} _{include, or be operatively coupled to receive data from or transfer data to, or both, one or more} _{mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.} _{However, a computer need not have such devices. Moreover, a computer can be embedded in} _{another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or} _{video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a} _{server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book} _{reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable} _{devices, that includes one or more processors and computer readable media, or a portable} _{storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.} _{Computer readable media suitable for storing computer program instructions and data include} _{all forms of non-volatile memory, media and memory devices, including by way of example} _{semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic} _{disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and} _{DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in,} _{special purpose logic circuitry.} _{The subject matter described in this specification can be implemented in a computing system} _{that includes a back end component, e.g., as a data server, or that includes a middleware} _{component, e.g., an application server, or that includes a front end component, e.g., a client} _{computer having a graphical user interface or a Web browser through which a user can interact} _{with an implementation of the subject matter described in this specification, or any combination} _{of one or more such back end, middleware, or front end components. The components of the} _{system can be interconnected by any form or medium of digital data communication, e.g., a} _{communication network. Examples of communication networks include a local area network} _{(“LAN”) and a wide area network (“WAN”), e.g., the Internet.} _{The computing system can include clients and servers. A client and server are generally remote} _{from each other and typically interact through a communication network. The relationship of} _{client and server arises by virtue of computer programs running on the respective computers} _{and having a client-server relationship to each other.} _{While this specification contains many specific implementation details, these should be} _{construed as descriptions of features that may be specific to particular examples of particular} _{inventions. Certain features that are described in this specification in the context of separate} _{examples can also be implemented in combination in a single example. Conversely, various} _{features that are described in the context of a single example can also be implemented in} _{multiple examples separately or in any suitable subcombination.} _{Similarly, while operations are depicted in the drawings in a particular order, this should not} _{be understood as requiring that such operations be performed in the particular order shown} _{or in sequential order, or that all illustrated operations be performed, to achieve desirable} _{results. In certain circumstances, multitasking and parallel processing may be advantageous.} _{Moreover, the separation of various system modules and components in the examples described} _{above should not be understood as requiring such separation in all examples, and it should be} _{understood that the described program components and systems can generally be integrated} _{together in a single software product or packaged into multiple software products.}

Claims

1. A method of training one or more neural networks, the one or more neural networks being _{for use in lossy image or video encoding, transmission and decoding, the method comprising} _{the steps of:} _{receiving an input image at a first computer system;} _{downsampling the input image with a downsampler to produce a downsampled input} image; e_{ncoding the downsampled input image using a first neural network to produce a latent} representation; d_{ecoding the latent representation using a second neural network to produce an output} _{image, wherein the output image is an approximation of the input image;} _{upsampling the output image with an upsampler to produce an upsampled output image;} _{evaluating a function based on a difference between one or more of: the output image} _{and the input image, the output image and the downsampled input image, the upsampled output} _{image and the input image, and/or the upsampled output image and the downsampled input} image; u_{pdating the parameters of the first neural network and the second neural network based} _{on the evaluated function; and} _{repeating the above steps using a first set of input images to produce a first trained neural} _{network and a second trained neural network.}

_{2. The method of claim 1, wherein the upsampler comprises a third neural network, and} _{wherein the method comprises updating the parameters of the third neural network based on} _{the evaluated function.} _{3. The method of claim 2, wherein the downsampler comprises a bilinear or bicubic} downsampler. _{4. The method of claim 2 or 3, wherein the downsampler comprises a Gaussian blur filter.} _{5. The method of any of claims 2 to 4, comprising (i) updating the parameters of the first neural} _{network and the second neural network based on the evaluated function for a first number} _{of said steps to produce the first and second trained neural networks without performing} _{said upsampling and downsampling and without updating the parameters of the third neural} _{network, (ii) freezing the parameters of the first and second neural networks after said number} _{first number of steps, and performing said upsampling and downsampling, and said updating} _{of the parameters of the third neural network for a second number of said steps.} _{6. The method of claim 1, wherein the downsampler comprises a fourth neural network, and} _{wherein the method comprises updating the parameters of the fourth neural network based on} _{the evaluated function.} _{7. The method of claim 5, wherein the upsampler comprises a bilinear or bicubic upsampler.} _{8. The method of claim 6 or 7, comprising (i) updating the parameters of the first neural} _{network and the second neural network based on the evaluated function for a first number of} _{said steps to produce the first and second trained neural networks without performing said} _{upsampling and downsampling and without updating the parameters of the fourth neural} _{network, (ii) freezing the parameters of the first and second neural networks after said number} _{first number of steps, and performing said downsampling and said updating of the parameters} _{of the fourth neural network for a second number of said steps.} _{9. The method of any of claims 2 to 8, wherein the method comprises entropy encoding the} _{latent representation into a bitstream having a length, wherein the function is further based on} _{said bitstream length, and wherein said updating the parameters of the third or fourth neural} _{network is based on the evaluated function based on the bitstream length.} _{10. The method of any of claims 1 to 9, wherein the difference between one or more of: the} _{output image and the input image, the output image and the downsampled input image, the} _{upsampled output image and the input image, and/or the upsampled output image and the} _{downsampled input image is determined based on the output of a fifth neural network acting} _{as a discriminator.} _{11. The method of any of claims 1 to 10, wherein the said difference between one or more of:} _{the output image and the input image, the output image and the downsampled input image,} _{the upsampled output image and the input image, and/or the upsampled output image and} _{the downsampled input image, comprises a mean squared error (MSE) and/or a structural} _{similarity index measure (SSIM).}

_{12. The method of any of claims 1 to 11, wherein the function further comprises a term} _{defining a visual perceptual metric.} _{13. The method of claim 12, wherein the term defining a visual perceptual metric comprises a} _{MS-SIM metric.} _{14. A method of training one or more neural networks, the one or more neural networks being} _{for use in lossy image or video encoding, transmission and decoding, the method comprising} _{the steps of:} _{receiving an input image at a first computer system;} _{encoding the input image using a first neural network to produce a latent representation;} _{decoding the latent representation using a second neural network to produce an output} _{image, wherein the output image is an approximation of the input image;} _{upsampling the output image with an upsampler to produce an upsampled output image,} _{the upsampler comprising a third neural network;} _{evaluating a function based on a difference between one or both of: the output image} _{and the input image, and/or the upsampled output image and the input image;} _{updating the parameters of the third neural network based on the evaluated function; and} _{repeating the above steps using a first set of input images to produce a first trained neural} _{network and a second trained neural network.} _{15. A method of training one or more neural networks, the one or more neural networks being} _{for use in lossy image or video encoding, transmission and decoding, the method comprising} _{the steps of:} _{receiving an input image at a first computer system;} _{downsampling the input image with a downsampler to produce a downsampled input} _{image, the downsampler comprising a fourth neural network;} _{encoding the downsampled input image using a first neural network to produce a latent} representation; d_{ecoding the latent representation using a second neural network to produce an output} _{image, wherein the output image is an approximation of the input image;} _{evaluating a function based on a difference between one or more of: the output image} _{and the input image, the output image and the downsampled input image and/or the input} _{image and a previously downsampled input image;} _{updating the parameters of the fourth neural network based on the evaluated function;} and r_{epeating the above steps using a first set of input images to produce a first trained neural} _{network and a second trained neural network.} _{16. The method of claim 15, comprising producing the previously downsampled input image} _{by performing bilinear or bicubic downsampling on the input image.} _{17. A method for lossy image or video encoding, transmission and decoding, the method} _{comprising the steps of:} _{receiving an input image at a first computer system;} _{downsampling the input image with a downsampler;} _{encoding the downsampled input image using a first trained neural network to produce a} _{latent representation;} _{transmitting the latent representation to a second computer system;} _{decoding the latent representation using a second trained neural network to produce an} _{output image, wherein the output image is an approximation of the input image; and} _{upsampling the output image with an upsampler to produce an upsampled output image.} _{18. A method for lossy image or video encoding, transmission and decoding, the method} _{comprising the steps of:} _{receiving an input image at a first computer system;} _{encoding the input input image using a first trained neural network to produce a latent} representation; t_{ransmitting the latent representation to a second computer system; and} _{decoding the latent representation using a second trained neural network to produce an} _{output image, wherein the output image is an approximation of the input image;} _{wherein one or more of the above steps comprises performing a downsampling or} _{upsampling operation, and} _{wherein the downsampling or upsampling operation comprises performing one or} _{convolution operations without performing a space-to-depth or depth-to-space operation.} _{19. The method of claim 18, comprising performing the downsampling or upsampling} _{operation on a CPU without performing a space-to-depth or depth-to-space operation, and} _{wherein said downsampling or upsampling is configured to be performed in real-time or near} real-time.

_{20. The method of claim 18, comprising performing the downsampling or upsampling} _{operation on a neural accelerator without performing a space-to-depth or depth-to-space} _{operation, and} _{wherein said downsampling or upsampling is configured to be performed in real-time or near} real-time. _{21. The method of claim 19 or 20, wherein the downsampling operation comprises applying} _{one or more convolutional layers with a kernel size based on a downsampling factor, and} _{wherein the convolutional layers are configured to sequentially reduce the spatial dimensions of} _{an input to the series of convolutional layers while increasing the depth or channel dimension} _{of the input.} _{22. The method of claim 21, wherein the input comprises said input image.} _{23. The method of claim 21, wherein the input comprises a tensor representation of the input} image. _{24. The method of claim 19 or 20, wherein the downsampling operation comprises applying} _{one or more convolutional layers configured with a stride equal to the downsampling factor,} _{and wherein the number of filters in each convolutional layer is based on the original number} _{of channels in the input tensor and on the downsampling factor.} _{25. The method of claim 19 or 20, wherein the upsampling operation comprises applying} _{sequential convolutional layers and upsampling layers, wherein the convolutional layers} _{are configured to progressively increase the spatial dimensions and decrease the channel} _{dimensions of an input to said layers.} _{26. The method of claim 25, wherein the input comprises the latent representation.} _{27. The method of claim 25, wherein the input comprises a tensor representation of the latent} _{representation or the output image.} _{28. The method of claim 25, wherein the convolutional layers for upsampling are configured} _{with a stride based on an upsampling factor.} _{29. The method of claim 25, further comprising applying an activation function after each} _{convolutional layer in the upsampling operation.} _{30. The method of claim 25, wherein the upsampling layers are selected from a group} _{consisting of nearest neighbor upsampling, bilinear upsampling, and bicubic upsampling, and} _{are alternated with the convolutional layers.} _{31. A method for lossy image or video encoding, transmission and decoding, the method} _{comprising the steps of:} _{receiving an input image and a second image at a first computer system;} _{estimating optical flow information using the second image and the input image using a} _{first neural network, the optical flow information being indicative of a difference between a} _{representation of the second image and a representation of the input image;} _{transmitting the optical flow information to a second computer system;} _{decoding the optical flow information using a second neural network;and} _{using the second image and the decoded optical flow information to produce an output} _{image, wherein the output image is an approximation of the input image;} _{wherein estimating the optical flow information comprises estimating differences between} _{the input image and second image by:} _{applying a first convolution operation on respective pixels of a representation of the} _{input image and/or on respective pixels of a representation of the second image, wherein} _{the convolution operation comprises applying one or more filters comprising weights having} _{values randomly distributed between a minimum and a maximum value.} _{32. The method of claim 31, comprising estimating a compressively encoded cost volume} _{indicative of said differences by applying said first convolution operation.} _{33. The method of claim 31, wherein the first convolution substantially preserves a norm of a} _{distribution of the respective pixels of the representation of the input image and/or respective} _{pixels of the representation of the second image.} _{34. The method of any of claims 31 to 33, wherein a distribution of values of pixels of the} _{representation of the input image and/or the distribution of values of pixels of the representation} _{of the second image are sparse distributions in a spatial domain of the representation of the} _{input image and/or the second image.} _{35. The method of claim 34, wherein the weights have values distributed according to a} _{sub-Gaussian distribution.}

_{36. The method of claim 34 or 35, wherein the minimum value and/or the maximum value} _{are based on a number of channels of the input image and/or second image, on a kernel size} _{of the first convolution operation, and/or on a pixel radius across which said differences are} estimated. _{37. The method of any of claims 31 to 36, comprising performing a second convolution} _{operation on an output of the first convolution operation, wherein the second convolution} _{operation substantially preserves a norm of a distribution of said output of the first convolution} operation. _{38. The method of any of claims 31 to 37, comprising estimating a difference between an} _{output of the second convolution operation and an output of the first convolution operation.} _{39. The method of claim 38, wherein said difference comprises an absolute difference.} _{40. The method of claim 39, wherein said difference defines a cost volume.} _{41. The method of any of claims 31 to 40, comprising using the optical flow information to} _{warp a representation of the second image.} _{42. The method of claim 41, comprising estimating a difference between the warped second} _{image and the input image to generate a residual representation of the input image with respect} _{to the warped second image.}

_{43. The method of claim 42, comprising: (i) using a third neural network to encode the residual} _{representation of the input image, (ii) transmitting the encoded residual representation of the} _{input image to the second computer system, (iii) using a fourth neural network to decode the} _{residual representation of the input image, and (iv) using decoded the residual representation} _{of the input image to produce said output image.} _{44. The method of any of claims 37 to 43, comprising applying a third convolution operation} _{to an output of the first convolution operation and/or to an output of the second convolution} operation. _{45. The method of any of claims 31 to 44, wherein a kernel size of the second convolution} _{operation is greater than a kernel size of the first convolution operation.} _{46. The method of any of claims 31 to 45, wherein the first convolution operation is defined by} _{a 1x1 kernel.} _{47. The method of any of claims 31 to 46, wherein the second convolution operation is defined} _{by a 3x3 kernel.} _{48. The method of any of claims 44 to to 47, wherein the third convolution operation is defined} _{by a 1x1 kernel.} _{49. The method of any of claims 31 to 48, wherein performing the second convolution} _{operation entangles information associated with respective pixels of the representation of} _{the input image with information associated with pixels adjacent corresponding pixels in the} _{representation of the second image.} _{50. The method of any of claims 31 to 49, wherein the first, second, and where present third} _{convolution operation are performed without grouped convolutions.} _{51. The method of claim 50, wherein one or more outputs of the first, second and/or third} _{convolution operation are stored in contiguous memory blocks, and wherein said estimating a} _{difference comprises retrieving said stored outputs from said contiguous memory blocks.} _{52. The method of any of claims 31 to 51, wherein a distribution of pixel values of the input} _{image and of the second image is comprises a sparse distribution, and wherein the sparse} _{distribution is incoherent in a spatial domain.} _{53. A data processing system configured to perform the method of any one of claims 31 to 52.} _{54. A method for lossy image or video encoding and transmission, the method comprising the} _{steps of:} _{receiving an input image and a second image at a first computer system;} _{estimating optical flow information using the second image and the input image using a} _{first neural network, the optical flow information being indicative of a difference between a} _{representation of the second image and a representation of the input image;} _{transmitting the optical flow information to a second computer system;} _{wherein estimating the optical flow information comprises estimating differences between} _{the input image and second image by:} _{applying a first convolution operation on respective pixels of a representation of the} _{input image and/or on respective pixels of a representation of the second image, wherein} _{the convolution operation comprises applying one or more filters comprising weights having} _{values randomly distributed between a minimum and a maximum value.} _{55. A method for lossy image or video decoding, the method comprising the steps of:} _{receiving an input image and a second image at a first computer system;} _{estimating optical flow information using the second image and the input image using a} _{first neural network, the optical flow information being indicative of a difference between a} _{representation of the second image and a representation of the input image and based on a} _{compressively encoded cost volume;} _{at a second computer system, receiving optical flow information, the optical flow} _{information being indicative of a difference between a representation of a second image and a} _{representation of an input image;} _{decoding the optical flow information using a second neural network; and} _{using the second image and the decoded optical flow information to produce an output} _{image, wherein the output image is an approximation of the input image.} _{56. A data processing apparatus configured to perform the method of claims 54 or 55.} _{57. A method of estimating a difference between a first image and a second image, the method} comprising: p_{erforming a first convolution operation on respective pixels of a representation of the} _{first image and on respective pixels of a representation of the second image; and} _{estimating a difference between the first image and the second image based on one or} _{more outputs of the first convolution operation on the first and second images by:} _{estimating a compressively encoded cost volume indicative of said differences.} _{58. The method of claim 57, comprising performing a second convolution operation on an} _{output of the first convolution operation; and estimating a difference between an output of the} _{second convolution operation and the first convolution operation.} _{59. The method of claim 58, wherein performing the second convolution operation entangles} _{information associated with respective pixels of the representation of the first with information} _{associated with pixels adjacent corresponding pixels in the representation of the second image.} _{60. The method of any of claims 57 to 59, wherein the first convolution operation comprises} _{applying one or more filters comprising weights having values randomly distributed between a} _{minimum value and a maximum value.} _{61. The method of claim 60, wherein the minimum value and/or the maximum value are based} _{on a number of channels of the input image and/or second image, on a kernel size of the first} _{convolution operation, and/or on a pixel radius across which said differences are estimated.} _{62. The method of any of claims 57 to 61, wherein said difference comprises an absolute} difference. _{63. The method of any of claims 57 to 62, wherein said difference defines a cost volume.}

_{64. The method of any of claims 57 to 63, comprising applying a third convolution operation} _{to an output of the first convolution operation and/or to an output of the second convolution} operation. _{65. The method of any of claims 57 to 64, wherein a kernel size of the second convolution} _{operation is greater than a kernel size of the first convolution operation.} _{66. The method of any of claims 57 to 65, wherein the first convolution operation is defined by} _{a 1x1 kernel.} _{67. The method of any of claims 57 to 66, wherein the second convolution operation is defined} _{by a 3x3 kernel.} _{68. The method of any of claims 57 to 67, wherein the third convolution operation is defined} _{by a 1x1 kernel.} _{69. The method of any of claims 57 to 68, comprising storing a plurality of said respective} _{outputs of the first, second and/or third convolution operations in contiguous memory blocks,} _{and wherein said estimating a difference comprises retrieving said stored outputs from said} _{contiguous memory blocks.} _{70. The method of any of claims 57 to 69, comprising using said difference to identify one or} _{more pixel patches in the second image as movement-containing pixel patches, and generating} _{a bounding box around one or more of said movement-containing pixel patches.}

_{71. data processing apparatus configured to perform the method of claims 57 to 70.} _{72. A method of training one or more neural networks, the one or more neural networks being} _{for use in lossy image or video encoding, transmission and decoding, the method comprising} _{the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information,} _{the optical flow information being indicative of a difference between the first image and the} _{second image;} _{with a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information;} _{with a third neural network, producing an output image using the optical flow information} _{and using a representation of the first image, wherein the output image is an approximation of} _{the first image;} _{evaluating a function based on a difference between the first image and the second image,} _{the function comprising a Jacobian penalty term,} _{updating the parameters of the first, second and/or third neural networks based on the} _{evaluated function; and} _{repeating the above steps using a first set of input images to produce first, second and/or} _{third trained neural networks.} _{73. The method of claim 72, wherein the Jacobian penalty term is based on a rate of change of} _{one or more first variables with respect to one or more second variables, the first variables and} _{second variables selected from inputs and/or outputs associated with the one or more neural} networks. _{74. The method of claim 73, wherein at least the input and/or output associated with the one} _{or more neural networks is both a first variable and a second variable.} _{75. The method of any of claims 73 to 74, comprising producing the second variables from} _{the first variables by mapping the first variables to the second variables.} _{76. The method of claim 75 wherein the mapping is defined by an auxiliary function.} _{77. The method of claim 76, wherein the first variables are inputs to the auxiliary function and} _{the second variables are outputs of the auxiliary function.} _{78. The method of claim 77, wherein at least one input of said inputs to the auxiliary function} _{is also an output of the auxiliary function.} _{79. The method of claim 78, wherein the inputs of said mapping are defined in an input space,} _{and the outputs of said mapping are defined in an output space, and wherein the auxiliary} _{function maps the input space to the output space.} _{80. The method of claim 79, wherein the input space matches the output space.} _{81. The method of any of claims 76 to 80, wherein the auxiliary function is based on the third} _{neural network.}

_{82. The method of claim 81, wherein the third neural network comprises a residual decoder} _{neural network.} _{83. The method of any of claims 78 to 82, wherein the at least one input to the auxiliary} _{function that is an output of the auxiliary function comprises said latent representation of the} _{first image.} _{84. The method of any of claims 72 to 83, comprising weighting the Jacobian penalty term.} _{85. The method of claim 84, wherein said weighting is based on a difference between the first} _{image and the second image.} _{86. The method of claim 84 when dependent on claim 73, wherein said weighting is defined} _{by a weighted norm based on a matrix associated with said rate of change.} _{87. The method of any of claims 73 to 86 when dependent on claim 83, comprising estimating} _{the Jacobian penalty term by approximating a norm of a matrix associated with said rate of} change. _{88. The method of claim 87, wherein approximating the norm of the matrix comprises making} _{a single sample approximation.} _{89. The method of any of claims 72 to 88, wherein the method comprises introducing the} _{Jacobian penalty term into said function after a first number of said repeated steps.}

_{90. The method of claim 89, wherein said first number of said repeated steps is based on a} _{GOP-size of one or more frame sequences in said first set of input images.} _{91. A method of performing lossy image or video encoding, transmission and decoding, the} _{method comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information,} _{the optical flow information being indicative of a difference between the first image and the} _{second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information;} _{with a third neural network, producing an output image using the optical flow information} _{and using a representation of the first image, wherein the output image is an approximation of} _{the first image;} _{wherein the first neural network, the second neural network, and the third neural network} _{are produced according to the method of any of claims 72 to 90.} _{92. A method of performing lossy image or video encoding, transmission, the method} _{comprising the steps of:} _{receiving a first image and a second image at a first computer system;} _{with a first neural network, producing a latent representation of optical flow information,} _{the optical flow information being indicative of a difference between the first image and the} _{second image;} _{transmitting the latent representation of optical flow information to a second computer} system; w_{herein the first neural network,is produced according to the method of any of claims} _{72 to 90.} _{93. A method of performing lossy image decoding, the method comprising the steps of:} _{receiving a latent representation of optical flow information at a second computer system;,} _{the optical flow information being indicative of a difference between a first image and a second} image; w_{ith a second neural network, decoding the latent representation of optical flow} _{information to produce an approximation of the optical flow information;} _{with a third neural network, producing an output image using the optical flow information} _{and using a representation of the first image, wherein the output image is an approximation of} _{the first image;} _{wherein the second neural network and the third neural network are produced according} _{to the method of any of claims 72 to 90.} _{94. A method of performing lossy image or video decoding, the method comprising the steps} of: w_{ith a second neural network, at a second computer system, decoding a latent represen-} _{tation to produce a first output image, wherein the first output image is an approximation of} _{one image of an image pair of a first sequence of input images;} _{repeating the above step to produce a first sequence of output images, the first sequence} _{of output images being an approximation of the first sequence of input images;} _{wherein the second neural network is produced according to the method of any of claims} _{72 to 90.} _{95. A data processing apparatus configured to perform the method of any of claims 72 to 94.} _{96. A computer program comprising instructions which, when the program is executed by a} _{computer, cause the computer to carry out the method of any of claims 72 to 94.} _{97. A computer-readable storage medium comprising instructions which, when executed by a} _{computer, cause the computer carry out the method of any of claims 72 to 94.}