WO2025088034A1

WO2025088034A1 - Method and data processing system for lossy image or video encoding, transmission and decoding

Info

Publication number: WO2025088034A1
Application number: PCT/EP2024/080064
Authority: WO
Inventors: Hamza ALAWIYE; Christian ETMANN; Christopher FINLAY; Arsalan ZAFAR
Original assignee: Deep Render Ltd
Current assignee: Deep Render Ltd
Priority date: 2023-10-27
Filing date: 2024-10-24
Publication date: 2025-05-01
Anticipated expiration: 2026-04-27

Abstract

A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a corresponding a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation at a first target image quality of the input image; evaluating a function based on a difference between the output image and a corresponding image having an image quality different to the target image quality; updating the parameters of the first neural network and the second neural network based on the evaluated function to produce a first trained neural network and a second trained neural network.

Description

Method and data processing system for lossy image or video encoding, transmission and decoding BACKGROUND This invention relates to a method and system for lossy image or video encoding, transmission and decoding, a method, apparatus, computer program and computer readable storage medium for lossy image or video encoding and transmission, and a method, apparatus, computer program and computer readable storage medium for lossy image or video receipt and decoding. There is increasing demand from users of communications networks for images and video content. Demand is increasing not just for the number of images viewed, and for the playing time of video; demand is also increasing for higher resolution content. This places increasing demand on communications networks and increases their energy use because of the larger amount of data being transmitted. To reduce the impact of these issues, image and video content is compressed for transmission across the network. The compression of image and video content can be lossless or lossy compression. In lossless compression, the image or video is compressed such that all of the original information in the content can be recovered on decompression. However, when using lossless compression there is a limit to the reduction in data quantity that can be achieved. In lossy compression, some information is lost from the image or video during the compression process. Known compression techniques attempt to minimise the apparent loss of information by the removal of information that results in changes to the decompressed image or video that is not particularly noticeable to the human visual system. JPEG, JPEG2000, AVC, HEVC and AVI are examples of compression processes for image and/or video files. In general terms, known lossy image compression techniques use the spatial correlations between pixels in images to remove redundant information during compression. For example, in an image of a blue sky, if a given pixel is blue, there is a high likelihood that the neighbouring pixels, and their neighbouring pixels, and so on, are also blue. There is accordingly no need to retain all the raw pixel data. Instead, we can retain only a subset of the pixels which take up fewer bits and infer the pixel values of the other pixels using information derived from spatial correlations. A similar approach is applied in known lossy video compression techniques. That is, spatial correlations between pixels allow the removal of redundant information during compression. However, in video compression, there is further information redundancy in the form of temporal correlations. For example, in a video of an aircraft flying across a blue-sky background, most of the pixels of the blue sky do not change at all between frames of the video. The most of the blue sky pixel data for the frame at position t = 0 in the video is identical to that at position t = 10. Storing this identical, temporally correlated, information is inefficient. Instead, only the blue sky pixel data for a subset of the frames is stored and the rest are inferred from information derived from temporal correlations. In the realm of lossy video compression in particular, the removal of redundant temporally correlated information in a video sequence is as known inter-frame redundancy. One technique using inter-frame redundancy that is widely used in standard video compression algorithms involves the categorization of video frames into three types: I-frames, P-frames, and B-frames. Each frame type carries distinct properties concerning their encoding and decoding process, playing different roles in achieving high compression ratios while maintaining acceptable visual quality. I-frames, or intra-coded frames, serve as the foundation of the video sequence. These frames are self-contained, each one encoding a complete image without reference to any other frame. In terms of compression, I-frames are least compressed among all frame types, thus carrying the most data. However, their independence provides several benefits, including being the starting point for decompression and enabling random access, crucial for functionalities like fast-forwarding or rewinding the video. P-frames, or predictive frames, utilize temporal redundancy in video sequences to achieve greater compression. Instead of encoding an entire image like an I-frame, a P-frame represents the difference between itself and the closest preceding I- or P-frame. The process, known as motion compensation, identifies and encodes only the changes that have occurred, thereby significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on previous frames for decoding. Consequently, any error during the encoding or transmission process may propagate to subsequent frames, impacting the overall video quality. B-frames, or bidirectionally predictive frames, represent the highest level of compression. Unlike P-frames, B-frames use both the preceding and following frames as references in their encoding process. By predicting motion both forwards and backwards in time, B-frames encode only the differences that cannot be accurately anticipated from the previous and next frames, leading to substantial data reduction. Although this bidirectional prediction makes B-frames more complex to generate and decode, it does not propagate decoding errors since they are not used as references for other frames. Artificial intelligence (AI) based compression techniques achieve compression and decom- pression of images and videos through the use of trained neural networks in the compression and decompression process. Typically, during training of the neutral networks, the difference between the original image and video and the compressed and decompressed image and video is analyzed and the parameters of the neural networks are modified to reduce this difference while minimizing the data required to transmit the content. However, AI based compression methods may achieve poor compression results in terms of the appearance of the compressed image or video or the amount of information required to be transmitted. An example of an AI based image compression process comprising a hyper-network is described in Ballé, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv preprint arXiv:1802.01436 (2018), which is hereby incorporated by reference. An example of an AI based video compression approach is shown in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. A further example of an AI based video compression approach is shown in Mentzer, F., Agustsson, E., Ballé, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural video compression using gans for detail synthesis and propagation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI (pp. 562-578), which is hereby incorporated by reference. Figure 3 of which shows an architecture that calculates optical flow with a flow model, UFlow, and encodes the calculated optical flow with a flow encoder, Eflow. SUMMARY According to a first aspect, there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a corresponding a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation at a first target image quality of the input image; evaluating a function based on a difference between the output image and a corresponding image having an image quality different to the target image quality; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. Optionally, the target image quality is defined by a target compression rate. Optionally, the difference between the output image and the corresponding image having an image quality different to the target image quality is determined based on the output of a third neural network acting as a discriminator between the output image and the corresponding image. Optionally, the target image quality and the image quality of the corresponding image for a first iteration of said steps are different to the target image quality and the image quality of the corresponding image for a second iteration of said steps. Optionally, the method comprises selecting the target image quality by selecting a regularisation parameter value from a set of regularisation parameter values. Optionally, the function is a loss function comprising a rate term and a distortion term and wherein the regularisation parameter value is associated with the rate term or the distortion term. Optionally, the method comprises: selecting a pair of vectors from a plurality of vector pairs and processing the latent representation using one vector of the selected pair to produce a modified latent representation; processing the modified latent representation using the other vector of the selected vector pair and decoding the processed modified latent representation using the second neural network to produce the output image, wherein each pair of vectors from the plurality of vector pairs is associated with a corresponding one regularisation parameter value of the set of regularisation parameter values. Optionally, said selection of the regularisation parameter value and corresponding pair of vectors comprises a random selection. Optionally, the selection is different for a first iteration of said steps compared to a second iteration of said steps. Optionally, the difference between the output image and the corresponding image is determined based on an output of at least one third neural network of a plurality of third neural networks acting as respective discriminators for different target image qualities. Optionally, the target image quality for a first iteration of said steps is different to a target image quality for a second iteration of said steps, and wherein each target image quality is associated with a corresponding at least one third neural network of the plurality of third neural networks. Optionally, the target image quality for a third iteration of said steps is different to a target image quality of the first and/or second iteration of said steps. Optionally, at least one target image quality corresponds to a copy of the input image. Optionally, at least one of the one or more third neural networks is configured to classify the output image into one or more image quality classes. Optionally, the difference on which the function is based is comprises a classification of the output image into the one or more image quality classes. Optionally, at least one of the one or more third neural networks is configured to score the output image against the corresponding image. Optionally, the difference on which the function is based comprises the score output by the third neural network. Optionally, at least one of the third neural networks comprises a Wasserstein discriminator. Optionally, the method comprises updating the parameters of at least one of the third neural networks based on the evaluated function in a first of said steps. Optionally, the method comprises updating the parameters of the first neural network and the second neural network based on the evaluated function in a second of said steps. Optionally, the target image quality is defined in bits per pixel. Optionally, the method comprises entropy encoding the latent representation into a bitstream having a bit length, and wherein the target image quality is based on the bit length of the bitstream. Optionally, the target image quality is defined by a number of image artefacts in the output image. According to a further aspect, there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural networks to produce a latent representations; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation at a first target image quality of the input image; evaluating a function based on a difference between the output image and a previously decoded image, the previously decoded image being an approximation of the input image at a second target image quality; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. According to a further aspect, there is provided a data processing apparatus configured to perform any of the above methods. According to a further aspect, there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; selecting a pair of vectors from a plurality of vector pairs and processing the latent representation using one vector of the selected pair to produce a modified latent representation; processing the modified latent representation using the other vector of the selected vector pair and decoding the processed modified latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; evaluating a function based on a difference between the input image and the output image, wherein the function comprises a rate term and a distortion term, and wherein one or both of the rate term or distortion term is regularised by a regularisation parameter associated with the selected vector pair; updating the parameters of the first neural network, the second neural network, and the values of the selected vector pair based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network, a second trained neural network, and a learned plurality of vector pairs. Optionally, the selection of the pair of vectors and associated regularisation parameter is a random selection from the plurality of vector pairs and associated regularisation parameters. Optionally, the selection of the pair of vectors and associated regularisation parameter in a first iteration of said steps is different to the selection in a second iteration of said steps. Optionally, the selection of the pair of vectors and associated regularisation parameter in a third iteration of said steps is different to the selection in the first and/or second iteration of said steps. Optionally, each pair of vectors and associated regularisation parameter is associated with a target reconstruction quality of the output image. Optionally, each pair of vectors and associated regularisation parameter is associated with one of a plurality of target bitrates of a video encoding bitrate ladder. Optionally, the method comprises encoding the latent representation using a third neural network to produce a hyper latent representation; selecting a second pair of vectors from a plurality of vector pairs and processing the hyper latent representation using one vector of the selected pair to produce a modified hyper latent representation; processing the modified hyper latent representation using the other vector of the selected vector pair and decoding the processed modified hyper latent representation using a second neural network. According to a further aspect, there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; processing the latent representation using one vector of a selected pair of vectors to produce a modified latent representation; transmitting the modified latent representation to a second computer system; and processing the modified latent representation using the other vector of the selected vector pair and decoding the processed modified latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image at a target image quality associated with the selected vector pair. Optionally, the method comprises transmitting from the first computer system to the second computer system information indicative of the selection of the vector pair. According to a further aspect, there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; processing the latent representation using one vector of a selected pair of vectors to produce a modified latent representation; transmitting the modified latent representation to a second computer system. According to a further aspect, there is provided a method for lossy image or video decoding, the method comprising the steps of: receiving a modified latent representation at a second computer system; processing the modified latent representation using a vector of a selected vector pair and decoding the processed modified latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image at a target image quality associated with the selected vector pair. According to a further aspect, there is provided A data processing system configured to perform any of the above methods. According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; applying a first wavelet transform to the latent representation to produce a transformed latent representation; quantising the transformed latent representation; applying a second wavelet transform to the quantised transformed latent representation to produce the latent representation; decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. evaluating a function based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. Optionally, the first wavelet transform and the second wavelet transform comprise Haar transforms. Optionally, the second wavelet transform comprises an inverse of the first wavelet transform. Optionally, quantising the transformed latent representation comprises applying a rounding function to the latent representation. Optionally, updating the parameters of the first neural network and the second neural network based on the evaluated function comprises back-propagating a gradient of the function through the rounding function using straight through estimation. Optionally, using straight through estimation comprises setting an incoming gradient at the rounding function equal to a constant. Optionally, the method comprises: encoding the latent representation using a third neural network to produce a hyper latent representation; applying a third wavelet transform to the hyper latent representation to produce a transformed hyper latent representation; quantising the transformed hyper latent representation; applying a fourth wavelet transform to the quantised transformed hyper latent representation to produce the hyper latent representation decoding the hyper latent representation using a fourth neural network; and using the decoded hyper latent representation to decode the latent representation. Optionally, the fourth wavelet transform comprises an inverse of the third wavelet transform. Optionally, encoding the latent representation using the third neural network comprises encoding the transformed latent representation. Optionally, the method comprises: encoding the hyper latent representation using a fifth neural network to produce a hyper hyper latent representation; applying a fifth wavelet transform to the hyper hyper latent representation to produce a transformed hyper hyper latent representation; quantising the transformed hyper hyper latent representation; applying a sixth wavelet transform to the quantised transformed hyper hyper latent representation to produce the hyper hyper latent representation decoding the hyper hyper latent representation using a sixth neural network; and using the decoded hyper hyper latent representation to decode the hyperlatent representation. Optionally, the sixth wavelet transform comprises an inverse of the fifth wavelet transform. Optionally, the step of quantising modifies one or more elements of the transformed latent representation by a quantisation residual amount, and wherein said quantisation residual amount is reduced by said applying of the first wavelet transform to the latent representation. Optionally, the step of quantising is applied to the whole transformed latent representation without removing any elements of the transformed latent representation, and wherein said decoding is applied to the whole latent representation without removing any elements of the latent representation. According to a further aspect, there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; applying a first wavelet transform to the latent representation to produce a transformed latent representation; quantising the transformed latent representation; transmitting the quantised transformed latent representation to a second computer system; applying a second wavelet transform to the quantised transformed latent representation to produce the latent representation; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. Optionally, the first wavelet transform and second wavelet transform comprise Haar transforms. Optionally, the second wavelet transform comprises an inverse of the first wavelet transform. According to a further aspect, there is provided a data processing apparatus and/or system configured to perform any of the above methods. According to a further aspect, there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; applying a first orthogonal transform to the latent representation to produce a transformed latent representation; quantising the transformed latent representation; applying a second orthogonal transform to the quantised transformed latent representation to produce the latent representation; decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. evaluating a function based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. Optionally, the second orthogonal transform comprises an inverse of the first orthogonal transform. Optionally, the first and second orthogonal transforms are respectively defined by an orthogonal matrix, and wherein the method further comprises updating the values of each said orthogonal matrix based on the evaluated function; and repeating said steps to produce said learned values of each said orthogonal matrix. Optionally, the first and second orthogonal transforms are respectively defined by an orthogonal matrix comprising random values. Optionally, one or more values of each said orthogonal matrix is defined by a dependency on the input image. Optionally, one or more values of each said orthogonal matrix is independent of the input image. Optionally, the step of quantising modifies one or more elements of the transformed latent representation by a quantisation residual amount, and wherein said quantisation residual amount is reduced by said applying of the first orthogonal transform to the latent representation. Optionally, the step of quantising is applied to the whole transformed latent representation without removing any elements of the transformed latent representation, and wherein said decoding is applied to the whole latent representation without removing any elements of the latent representation. According to a further aspect, there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; applying a first orthogonal transform to the latent representation to produce a transformed latent representation; quantising the transformed latent representation; transmitting the quantised transformed latent representation to a second computer system; applying a second orthogonal transform to the quantised transformed latent representation to produce the latent representation; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. According to a further aspect, there is provided a data processing apparatus and/or system configured to perform any of the above methods. According to a further aspect, there is provided A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; applying a first invertible transform to the latent representation to produce a transformed latent representation; quantising the transformed latent representation; applying a second invertible transform to the quantised transformed latent representation to produce the latent representation; decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. evaluating a function based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. Optionally, the second invertible transform comprises an inverse of the first orthogonal transform. Optionally, the first and second invertible transforms are respectively defined by an invertible matrix, and wherein the method further comprises updating the values of each said invertible matrix based on the evaluated function; and repeating said steps to produce said learned values of each said invertible matrix Optionally, one or more values of each said invertible matrix is defined by a dependency on the input image. Optionally, one or more values of each said invertible matrix is independent of the input image. Optionally, the step of quantising modifies one or more elements of the transformed latent representation by a quantisation residual amount, and wherein said quantisation residual amount is reduced by said applying of the first invertible transform to the latent representation. Optionally, the step of quantising is applied to the whole transformed latent representation without removing any elements of the transformed latent representation, and wherein said decoding is applied to the whole latent representation without removing any elements of the latent representation. Accordng to a further aspect, there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; applying a first invertible transform to the latent representation to produce a transformed latent representation; quantising the transformed latent representation; transmitting the quantised transformed latent representation to a second computer system; applying a second invertible transform to the quantised transformed latent representation to produce the latent representation; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. According to a further aspect, there is provided a data processing apparatus configured to perform any of the above methods. BRIEF DESCRIPTION OF THE DRAWINGS Aspects of the invention will now be described by way of examples, with reference to the following figures in which: Figure 1 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 2 illustrates a further example of an image or video compression, transmission and decompression pipeline including a hyper-network. Figure 3 illustrates an example of a video compression, transmission and decompression pipeline. Figure 4 illustrates an example of a video compression, transmission and decompression system. Figure 4 illustrates an example of a video compression, transmission and decompression system. Figure 5 illustrates a rate distortion curve according to the present disclosure. Figure 6 illustrates a set of discriminators overlaid on a rate distortion curve according to the present disclosure. Figure 7 illustrates a set of discriminators overlaid on a rate distortion curve according to the present disclosure. Figure 8 illustrates a set of discriminators overlaid on a rate distortion curve according to the present disclosure. Figure 9 illustrates a single discriminator overlaid on a rate distortion curve according to the present disclosure. Figure 10 illustrates a single discriminator overlaid on a rate distortion curve according to the present disclosure. Figure 11 illustrates a single discriminator overlaid on a rate distortion curve according to the present disclosure. Figure 12 illustrates a single discriminator overlaid on a rate distortion curve according to the present disclosure. Figure 13 illustrates an example of a video compression, transmission and decompression pipeline. Figure 14 illustrates an example of a video compression, transmission and decompression pipeline. Figure 15 illustrates a quantisation residual distribution plot. Figure 16 illustrates a quantisation residual distribution plot. Figure 17 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 18 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 19a illustrates an example of an image or video compression, transmission and decompression pipeline. Figure 19b illustrates an example of an image or video compression, transmission and decompression pipeline. Figure 20 illustrates loss curves of training experiments performed according to the present disclosure. Figure 21 illustrates loss curves of training experiments performed according to the present disclosure. DETAILED DESCRIPTION OF THE DRAWINGS Compression processes may be applied to any form of information to reduce the amount of data, or file size, required to store that information. Image and video information is an example of information that may be compressed. The file size required to store the information, particularly during a compression process when referring to the compressed file, may be referred to as the rate. In general, compression can be lossless or lossy. In both forms of compression, the file size is reduced. However, in lossless compression, no information is lost when the information is compressed and subsequently decompressed. This means that the original file storing the information is fully reconstructed during the decompression process. In contrast to this, in lossy compression information may be lost in the compression and decompression process and the reconstructed file may differ from the original file. Image and video files containing image and video data are common targets for compression. In a compression process involving an image, the input image may be represented as ^^. The data representing the image may be stored in a tensor of dimensions ^^ × ^^ × ^^, where ^^ represents the height of the image, ^^ represents the width of the image and ^^ represents the number of channels of the image. Each ^^ × ^^ data point of the image represents a pixel value of the image at the corresponding location. Each channel ^^ of the image represents a different component of the image for each pixel which are combined when the image file is displayed by a device. For example, an image file may have 3 channels with the channels representing the red, green and blue component of the image respectively. In this case, the image information is stored in the RGB colour space, which may also be referred to as a model or a format. Other examples of colour spaces or formats include the CMKY and the YCbCr colour models. However, the channels of an image file are not limited to storing colour information and other information may be represented in the channels. As a video may be considered a series of images in sequence, any compression process that may be applied to an image may also be applied to a video. Each image making up a video may be referred to as a frame of the video. The output image may differ from the input image and may be represented by ˆ. The difference between the input image and the output image may be referred to as distortion or a difference in image quality. The distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way. An example of such a method is using the mean square error (MSE) between the pixels of the input image and the output image, but there are many other ways of measuring distortion, as will be known to the person skilled in the art. The distortion function may comprise a trained neural network. Typically, the rate and distortion of a lossy compression process are related. An increase in the rate may result in a decrease in the distortion, and a decrease in the rate may result in an increase in the distortion. Changes to the distortion may affect the rate in a corresponding manner. A relation between these quantities for a given compression technique may be defined by a rate-distortion equation. AI based compression processes may involve the use of neural networks. A neural network is an operation that can be performed on an input to produce an output. A neural network may be made up of a plurality of layers. The first layer of the network receives the input. One or more operations may be performed on the input by the layer to produce an output of the first layer. The output of the first layer is then passed to the next layer of the network which may perform one or more operations in a similar way. The output of the final layer is the output of the neural network. Each layer of the neural network may be divided into nodes. Each node may receive at least part of the input from the previous layer and provide an output to one or more nodes in a subsequent layer. Each node of a layer may perform the one or more operations of the layer on at least part of the input to the layer. For example, a node may receive an input from one or more nodes of the previous layer. The one or more operations may include a convolution, a weight, a bias and an activation function. Convolution operations are used in convolutional neural networks. When a convolution operation is present, the convolution may be performed across the entire input to a layer. Alternatively, the convolution may be performed on at least part of the input to the layer. Each of the one or more operations is defined by one or more parameters that are associated with each operation. For example, the weight operation may be defined by a weight matrix defining the weight to be applied to each input from each node in the previous layer to each node in the present layer. In this example, each of the values in the weight matrix is a parameter of the neural network. The convolution may be defined by a convolution matrix, also known as a kernel. In this example, one or more of the values in the convolution matrix may be a parameter of the neural network. The activation function may also be defined by values which may be parameters of the neural network. The parameters of the network may be varied during training of the network. Other features of the neural network may be predetermined and therefore not varied during training of the network. For example, the number of layers of the network, the number of nodes of the network, the one or more operations performed in each layer and the connections between the layers may be predetermined and therefore fixed before the training process takes place. These features that are predetermined may be referred to as the hyperparameters of the network. These features are sometimes referred to as the architecture of the network. To train the neural network, a training set of inputs may be used for which the expected output, sometimes referred to as the ground truth, is known. The initial parameters of the neural network are randomized and the first training input is provided to the network. The output of the network is compared to the expected output, and based on a difference between the output and the expected output the parameters of the network are varied such that the difference between the output of the network and the expected output is reduced. This process is then repeated for a plurality of training inputs to train the network. The difference between the output of the network and the expected output may be defined by a loss function. The result of the loss function may be calculated using the difference between the output of the network and the expected output to determine the gradient of the loss function. Back-propagation of the gradient descent of the loss function may be used to update the parameters of the neural network using the gradients ^^ ^^/ ^^ ^^ of the loss function. A plurality of neural networks in a system may be trained simultaneously through back-propagation of the gradient of the loss function to each network. In the context of image or video compression, this type of system, where simultaneous training with back-propagation through each element or the whole network architecture may be referred to as end-to-end, learned image or video compression. Unlike in traditional compression algorithms that use primarily handcrafted, manually constructed steps, an end-to-end learned system learns itself during training what combination of parameters best achieves the goal of minimising the loss function. This approach is advantageous compared to systems that are not end-to-end learned because an end-to-end system has a greater flexibility to learn weights and parameters that might be counter-intuitive to someone handcrafting features. It will be appreciated that the term "training" or "learning" as used herein means the process of optimizing an artificial intelligence or machine learning model, based on a given set of data. This involves iteratively adjusting the parameters of the model to minimize the discrepancy between the model’s predictions and the actual data, represented by the above-described rate-distortion loss function. The training process may comprise multiple epochs. An epoch refers to one complete pass of the entire training dataset through the machine learning algorithm. During an epoch, the model’s parameters are updated in an effort to minimize the loss function. It is envisaged that multiple epochs may be used to train a model, with the exact number depending on various factors including the complexity of the model and the diversity of the training data. Within each epoch, the training data may be divided into smaller subsets known as batches. The size of a batch, referred to as the batch size, may influence the training process. A smaller batch size can lead to more frequent updates to the model’s parameters, potentially leading to faster convergence to the optimal solution, but at the cost of increased computational resources. Conversely, a larger batch size involves fewer updates, which can be more computationally efficient but might converge slower or even fail to converge to the optimal solution. The learnable parameters are updated by a specified amount each time, determined by the learning rate. The learning rate is a hyperparameter that decides how much the parameters are adjusted during the training process. A smaller learning rate implies smaller steps in the parameter space and a potentially more accurate solution, but it may require more epochs to reach that solution. On the other hand, a larger learning rate can expedite the training process but may risk overshooting the optimal solution or causing the training process to diverge. The training described herein may involve use of a validation set, which is a portion of the data not used in the initial training, which is used to evaluate the model’s performance and to prevent overfitting. Overfitting occurs when a model learns the training data too well, to the point that it fails to generalize to unseen data. Regularization techniques, such as dropout or L1/L2 regularization, can also be used to mitigate overfitting. It will be appreciated that training a machine learning model is an iterative process that may comprise selection and tuning of various parameters and hyperparameters. As will be appreciated, the specific details, such as hyper parameters and so on, of the training process may vary and it is envisaged that producing a trained model in this way may achieved in a number of different ways with different epochs, batch sizes, learning rates, regularisations, and so on, the details of which are not essential to enabling the advantages and effects of the present disclosure, except where stated otherwise. The point at which an “untrained” neural network is considered be “trained” is envisaged to be case specific and depend on, for example, on a number of epochs, a plateauing of any further learning, or some other metric and is not considered to be essential in achieving the advantages described herein. More details of an end-to-end, learned compression process will now be described. It will be appreciated that in some cases, end-to-end, learned compression processes may be combined with one or more components that are handcrafted or trained separately. In the case of AI based image or video compression, the loss function may be defined by the rate distortion equation. The rate distortion equation may be represented by ^^ ^^ ^^ ^^ = ^^ + ^^ ∗ ^^, where ^^ is the distortion function, ^^ is a weighting factor, and ^^ is the rate loss. ^^ may be referred to as a lagrange multiplier. The langrange multiplier provides as weight for a particular term of the loss function in relation to each other term and can be used to control which terms of the loss function are favoured when training the network. In the case of AI based image or video compression, a training set of input images may be used. An example training set of input images is the KODAK image set (for example at www.cs.albany.edu/ xypan/research/snr/Kodak.html). An example training set of input images is the IMAX image set. An example training set of input images is the Imagenet dataset (for example at www.image-net.org/download). An example training set of input images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at http://challenge.compression.cc/tasks/). An example of an AI based compression, transmission and decompression process 100 is shown in Figure 1. As a first step in the AI based compression process, an input image 5 is provided. The input image 5 is provided to a trained neural network 110 characterized by a function ^^ _^^ acting as an encoder. The encoder neural network 110 produces an output based on the input image. This output is referred to as a latent representation of the input image 5. In a second step, the latent representation is quantised in a quantisation process 140 characterised by the operation ^^, resulting in a quantized latent. The quantisation process transforms the continuous latent representation into a discrete quantized latent. An example of a quantization process is a rounding function. In a third step, the quantized latent is entropy encoded in an entropy encoding process 150 to produce a bitstream 130. The entropy encoding process may be for example, range or arithmetic encoding. In a fourth step, the bitstream 130 may be transmitted across a communication network. In a fifth step, the bitstream is entropy decoded in an entropy decoding process 160. The quantized latent is provided to another trained neural network 120 characterized by a function ^^ _^^ acting as a decoder, which decodes the quantized latent. The trained neural network 120 produces an output based on the quantized latent. The output may be the output image of the AI based compression process 100. The encoder-decoder system may be referred to as an autoencoder. Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly compress given input data up to close to the fundamental entropy limit of that data, as determined by the total entropy of the distribution of that data. Accordingly, one way in which end-to-end, learned compression can minimise the rate loss term of the rate-distortion loss function and thereby increase compression effectiveness is to learn autoencoder parameter values that produce low entropy latent representation distributions. Producing latent representations distributed with as low an entropy as possible allows entropy encoding to compress the latent distributions as close to or to the fundamental entropy limit for that distribution. The lower the entropy of the distribution, the more entropy encoding can losslessly compress it and the lower the amount of data in the corresponding bitstream. In some cases where the latent representation is distributed according to a gaussian or Laplacian distribution, this learning may comprise learning optimal location and scale parameters of the gaussian or Laplacian distributions, in other cases, it allows the learning of more flexible latent representation distributions which can further help to achieve the minimising of the rate-distortion loss function in ways that are not intuitive or possible to do with handcrafted features. Examples of these and other advantages are described in WO2021/220008A1, which is incorporated in its entirety by reference. Something which is closely linked to the entropy encoding of the latent distribution and which accordingly also has an effect on the effectiveness of compression of end-to-end learned approaches is the quantisation step. During inference, a rounding function may be used to quantise a latent representation distribution into bins of given sizes, a rounding function is not differentiable everywhere. Rather, a rounding function is effectively one or more step functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary between steps). Back propagating a gradient of a loss function through a rounding function is challenging. Instead, during training, quantisation by rounding function is replaced by one or more other approaches. For example, the functions of a noise quantisation model are differentiable everywhere and accordingly do allow backpropagation of the gradient of the loss function through the quantisation parts of the end-to-end, learned system. Alternatively, a straight-through estimator (STE) quantisation model or one other quantisation models may be used. It is also envisaged that different quantisation models may be used for during evaluation of different term of the loss function. For example, noise quantisation may used to evaluate the rate or entropy loss term of the rate-distortion loss function while STE quantisation may be used to evaluate the distortion term. In a similar manner to how learning parameters to produce certain distributions of the latent representation facilitates achieving better rate loss term minimisation, end-to-end learning of the quantisation process achieves a similar effect. That is, learnable quantisation parameters provide the architecture with a further degree of freedom to achieve the goal of minimising the loss function. For example, parameters corresponding to quantisation bin sizes may be learned which is likely to result in an improved rate-distortion loss outcome compared to approaches using hand-crafted quantisation bin sizes. Further, as the rate-distortion loss function constantly has to balance a rate loss term against a distortion loss term, it has been found that the more degrees of freedom the system has during training, the better the architecture is at achieving optimal rate and distortion trade off. Returning to the compression pipeline more generally, the systems described above may be distributed across multiple locations and/or devices. For example, the encoder 110 may be located on a device such as a laptop computer, desktop computer, smart phone or server. The decoder 120 may be located on a separate device which may be referred to as a recipient device. The system used to encode, transmit and decode the input image 5 to obtain the output image 6 may be referred to as a compression pipeline. As described above in the context of quantisation, the AI based compression process may further comprise a hyper-network 105 for the transmission of meta-information that improves the compression process. The hyper-network 105 comprises a trained neural network 115 acting as a hyper-encoder ^^ ^ℎ and a trained neural networ ℎ ^_^ k 125 acting as a hyper-decoder ^^ _^^ . An example of such a

is shown in Figure 2. Components of the system not further discussed may be assumed to be the same as discussed above. The neural network 115 acting as a hyper-decoder receives the latent that is the output of the encoder 110. The hyper-encoder 115 produces an output based on the latent representation that may be referred to as a hyper-latent representation. The hyper-latent is then quantized in a quantization process 145 characterised by ^^^ℎ to produce a quantized hyper-latent. The quantization process 145 characterised by ^^^ℎ may be the same as the quantisation process 140 characterised by ^^ discussed above. In a similar manner as discussed above for the quantized latent, the quantized hyper-latent is then entropy encoded in an entropy encoding process 155 to produce a bitstream 135. The bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the quantized hyper-latent. The quantized hyper-latent is then used as an input to trained neural network 125 acting as a hyper-decoder. However, in contrast to the compression pipeline 100, the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder 115. Instead, the output of the hyper-decoder is used to provide parameters for use in the entropy encoding process 150 and entropy decoding process 160 in the main compression process 100. For example, the output of the hyper-decoder 125 can include one or more of the mean, standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation. In the example shown in Figure 2, only a single entropy decoding process 165 and hyper-decoder 125 is shown for simplicity. However, in practice, as the decompression process usually takes place on a separate device, duplicates of these processes will be present on the device used for encoding to provide the parameters to be used in the entropy encoding process 150. Further transformations may be applied to at least one of the latent and the hyper-latent at any stage in the AI based compression process 100. For example, at least one of the latent and the hyper latent may be converted to a residual value before the entropy encoding process 150,155 is performed. The residual value may be determined by subtracting the mean value of the distribution of latents or hyper-latents from each latent or hyper latent. The residual values may also be normalised. To perform training of the AI based compression process described above, a training set of input images may be used as described above. During the training process, the parameters of both the encoder 110 and the decoder 120 may be simultaneously updated in each training step. If a hyper-network 105 is also present, the parameters of both the hyper-encoder 115 and the hyper-decoder 125 may additionally be simultaneously updated in each training step. The training process may further include a generative adversarial network (GAN). When applied to an AI based compression process, in addition to the compression pipeline described above, an additional neutral network acting as a discriminator is included in the system. The discriminator receives an input and outputs a score based on the input providing an indication of whether the discriminator considers the input to be ground truth or fake. For example, the indicator may be a score, with a high score associated with a ground truth input and a low score associated with a fake input. For training of a discriminator, a loss function is used that maximizes the difference in the output indication between an input ground truth and input fake. When a GAN is incorporated into the training of the compression process, the output image 6 may be provided to the discriminator. The output of the discriminator may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Alternatively, the discriminator may receive both the input image 5 and the output image 6 and the difference in output indication may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Training of the neural network acting as a discriminator and the other neutral networks in the compression process may be performed simultaneously. During use of the trained compression pipeline for the compression and transmission of images or video, the discriminator neural network is removed from the system and the output of the compression pipeline is the output image 6. Incorporation of a GAN into the training process may cause the decoder 120 to perform hallucination. Hallucination is the process of adding information in the output image 6 that was not present in the input image 5. In an example, hallucination may add fine detail to the output image 6 that was not present in the input image 5 or received by the decoder 120. The hallucination performed may be based on information in the quantized latent received by decoder 120. Details of a video compression process will now be described. As discussed above, a video is made up of a series of images arranged in sequential order. AI based compression process 100 described above may be applied multiple times to perform compression, transmission and decompression of a video. For example, each frame of the video may be compressed, transmitted and decompressed individually. The received frames may then be grouped to obtain the original video. The frames in a video may be labelled based on the information from other frames that is used to decode the frame in a video compression, transmission and decompression process. As described above, frames which are decoded using no information from other frames may be referred to as I-frames. Frames which are decoded using information from past frames may be referred to as P-frames. Frames which are decoded using information from past frames and future frames may be referred to as B-frames. Frames may not be encoded and/or decoded in the order that they appear in the video. For example, a frame at a later time step in the video may be decoded before a frame at an earlier time. The images represented by each frame of a video may be related. For example, a number of frames in a video may show the same scene. In this case, a number of different parts of the scene may be shown in more than one of the frames. For example, objects or people in a scene may be shown in more than one of the frames. The background of the scene may also be shown in more than one of the frames. If an object or the perspective is in motion in the video, the position of the object or background in one frame may change relative to the position of the object or background in another frame. The transformation of a part of the image from a first position in a first frame to a second position in a second frame may be referred to as flow, warping or motion compensation. The flow may be represented by a vector. One or more flows that represent the transformation of at least part of one frame to another frame may be referred to as a flow map. An example AI based video compression, transmission, and decompression process 200 is shown in Figure 3. The process 200 shown in Figure 3 is divided into an I-frame part 201 for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be understood that these divisions into different parts are arbitrary and the process 200 may be also be considered as a single, end-to-end pipeline. As described above, I-frames do not rely on information from other frames so the I-frame part 201 corresponds to the compression, transmission, and decompression process illustrated in Figures 1 or 2. The specific details will not be repeated here but, in summary, an input image ^^₀ is passed into an encoder neural network 203 producing a latent representation which is quantised and entropy encoded into a bitstream 204. The subscript 0 in ^^₀ indicates the input image corresponds to a frame of a video stream at position t = 0. This may be the first frame of an entire video stream or the first frame of a chunk of a video stream made up of, for example, an I-frame and a plurality of subsequent P-frames and/or B-frames. The bitstream 204 is then entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed image ˆ₀ which in this case is an I-frame. The decoding step may be performed both locally at the same location as where the input image compression occurs as well as at the location where the decompression occurs. This allows the reconstructed image ˆ₀ to be available for later use by components of both the encoding and decoding sides of the pipeline. In contrast to I-frames, P-frames (and B-frames) do rely on information from other frames. Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only the input image ^^ _^^ that is to be compressed (corresponding to a frame of a video stream at position t), but also one or more previously reconstructed images ˆ _^^−1 from an earlier frame t-1. As described above, the previously reconstructed ˆ _^^−1 is available at both the encode and decode side of the pipeline and can accordingly be used for various purposes at both the encode and decode sides. At the encode side, previously reconstructed images may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames. In the example of Figure 3, both the image being compressed ^^ _^^ and the previously reconstructed image from an earlier frame ˆ _^^−1 are passed into a flow module part 206 of the pipeline. The flow module part 206 comprises an autoencoder such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 207 has been trained to produce a latent representation of a flow map from inputs ˆ _^^−1 and ^^ _^^ , which is indicative of inter-frame movement of pixels or pixel groups between ˆ _^^−1 and ^^ _^^ . The latent representation of the flow map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208. On the decode side, the bitstream is entropy decoded and passed to a decoder neural network 209 to produce a reconstructed flow map ^^ . The reconstructed flow map ^^ is applied to the previously reconstructed image ˆ _^^−1 to generate a warped image ˆ _{^^−1, ^^}. It is envisaged that any suitable warping technique may be used, for example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further envisaged that a scale-space flow approach as described in the above paper may also optionally be used. The warped image ˆ _{^^−1, ^^} is a prediction of how the previously reconstructed image ˆ _^^−1 might have changed between frame positions t-1 and t, based on the output flow map produced by the flow module part 206 autoencoder system from the inputs of ^^ _^^ and ˆ _^^−1. As with the I-frame, the reconstructed flow map ^^ and corresponding warped image ˆ _{^^−1, ^^} may be produced both on the encode side and the decode side of the pipeline so they are available for use by other components of the pipeline on both the encode and decode sides. In the example of Figure 3, both the image being compressed ^^ _^^ and the ˆ _{^^−1, ^^} are passed into a residual module part 210 of the pipeline. The residual module part 210 comprises an autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 211 has been trained to produce a latent representation of a residual map indicative of differences between the input mage ^^ _^^ and the warped image ˆ _{^^−1, ^^}. The latent representation of the residual map is then quantised and entropy encoded into a bitstream 212 and transmitted. The bitstream 212 is then entropy decoded and passed into a decoder neural network 213 which reconstructs a residual map ^^ from the decoded latent representation. Alternatively, a residual map may first be pre-calculated between ^^ _^^ and the ˆ _{^^−1, ^^} and the pre-calculated residual map may be passed into an autoencoder for compression only. This hand-crafted residual map approach is computationally simpler, but reduces the degrees of freedom with which the architecture may learn weights and parameters to achieve its goal during training of minimising the rate-distortion loss function. Finally, on the decode side, the residual map ^^ is applied (e.g. combined by addition, subtraction or a different operation) to the warped image to produce a reconstructed image ˆ _^^ which is a reconstruction of image ^^ _^^ and accordingly corresponds to a P-frame at position t in a sequence of frames of a video stream. It will be appreciated that the reconstructed image ˆ _^^ can then be used to process the next frame. That is, it can be used to compress, transmit and decompress ^^ _^^+1, and so on until an entire video stream or chunk of a video stream has been processed. Thus, for a block of video frames comprising an I-frame and ^^ subsequent P-frames, the bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual map of each P-frame image. For completeness, whilst not illustrated in Figure 3, any of the autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those described in connection with Figure 2. Accordingly, the bitstream may also contain hyper and hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so on, of those networks as applicable. Finally, the above approach may generally also be extended to B-frames, for example as is described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6680-6689). The above-described flow and residual based approach is highly effective at reducing the amount of data that needs to be transmitted because, as long as at least one reconstructed frame (e.g. I-frame ˆ _^^−1) is available, the encode side only needs to compress and transmit a flow map and a residual map (and any hyper or hyper-hyper parameter information, as applicable) to reconstruct a subsequent frame. Figure 4 shows an example of an AI image or video compression process such as that described above in connection with Figures 1-3 implemented in a video streaming system 400. The system 400 comprises a first device 401 and a second device 402. The first and second devices 401, 402 may be user devices such as smartphones, tablets, AR/VR headsets or other portable devices. In contrast to known systems which primarily perform inference on GPUs such as Nvidia A100, Geforce 3090, Gefore 4090 GPU cards, the system 400 of Figure 4 performs inference on a CPU (or NPU or TPU as applicable) of the first and second devices respectively. That is, compute for performing both encoding and decoding are performed by the respective CPUs of the first and second devices 401, 402. This places very different power usage, memory and runtime constraints on the implementation of the above methods than when implementing AI-based compression methods on GPUs. In one example, the CPU of first and second devices 401, 402 may comprise, for example, a Qualcomm Snapdragon CPU. The first device 401 comprises a media capture device 403, such as a camera, arranged to capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404. The video stream 404 is passed to a pre-processing module 406 which splits the video stream into blocks of frames, various frames of which will be designated as I-frames, P-frames, and/or B-frames. The blocks of frames are then compressed by an AI-compression module 407 comprising the encode side of the AI-based video compression pipeline of Figure 3. The output of the AI-compression module is accordingly a bitstream 408a which is transmitted from the first device 401, for example via a communications channel, for example over one or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409 communications. The second device 402 receives the communicated bitstream 408b which is passed to an AI-decompression module 410 comprising the decode side of the AI-based video compression pipeline of Figure 3. The output of the AI-decompression module 402 is the reconstructed I-frames, P-frames and/or B-frames which are passed to a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402. It is envisaged that the system 400 of Figure 4 may be used for live video streaming at 30fps of a 1080p video stream, which means a cumulative latency of both the encode and decode side is below substantially 50ms, for example substantially 30ms or less. Achieving this level of runtime performance with only CPU (or NPU or TPU) compute on user devices presents challenges which are not addressed by known methods and systems or in the wider AI-compression literature. For example, execution of different parts of the compression pipeline during inference may be optimized by adjusting the order in which operations are performed using one or more known CPU scheduling methods. Efficient scheduling can allow for operations to be performed in parallel, thereby reducing the total execution time. It is also envisaged that efficient management of memory resources may be implemented, including optimising caching methods such as storing frequently-accessed data in faster memory locations, and memory reuse, which minimizes memory allocation and deallocation operations. A number of concepts related to the AI compression processes and/or their implementation in a hardware system discussed above will now be described. Although each concept is described separately, one or more of the concepts described below may be applied in an AI based compression process as described above. Concept 1: Training with image quality discriminators As has been described above, one known approach to training the neural networks of a learned compression pipeline is to use a GAN setup whereby the compression pipeline acts as the generator and a further neural network is provided to act as a discriminator whose task is to discriminate (i.e. distinguish) between reconstructed images produced by the compression pipeline and ground truth images. The compression pipeline during training tries to learn to "trick" the discriminator while the discriminator tries to distinguish as correctly as it can. An example of this approach is described in WO2023/118317A1, which is incorporated herein in its entirety by reference. GAN-type training approaches such as this work well where neural networks of the compression pipeline are permitted to compress the input image(s) into bitstreams transmitted at bit rates typically associated with the upper end of a typical bitrate ladder. It will be appreciated that bitrate ladders are widely used in the image and video streaming industry to efficiently manage the streaming of compressed image and video data to end users at constant bit rates given a set of hardware and/or commercial constraints. A bitrate ladder is typically made up of a predetermined number of bitrates at which a video is to be streamed to an end user whereby each rung may be determined, for example using a convex hull method, or some other method, and/or be set to take commercial considerations into account such as end user device, resolution, internet speed, and so on. An example bitrate ladder may have, for example 5, 10, 15, 25 different bitrates between, for example 200 kbps at the lower image and video data quality end of the ladder, and 2600 kbps at the higher image and video data quality end of the ladder. Accordingly, if a learned compression pipeline is trained to compress the input image(s) into a bitstream corresponding with a target bitrate corresponding to a high quality end of a bitrate ladder, e.g. above 1000 kbps for example up to 2600kbps, the training is stable and converges well. This is at least in part because the neural networks are generally able to reconstruct the image well enough to trick the discriminator they are competing against as part of a GAN-training setup. However, the inventors have found that GAN-type training is problematic as it typically fails to learn how to compress and reconstruct images well when the target bitrate is towards the lower end of a bitrate ladder, e.g. below 1000kbps for example around 200kbps. This is because the reconstructed image quality is so low and filled with artefacts that the discriminator has no trouble correctly distinguishing reconstructed images from ground truth images almost all of the time. This issue can be understood through the analogy of a human finding it trivial to distinguish between a heavily compressed, artefact filled, low resolution image streamed at 200 kbps, and its high resolution, HD ground truth counterpart streamed at 2600 kbps. Effectively the human would always be able to correctly identify the reconstructed image from the ground truth image. The inventors have found that modifying known GAN-type approaches, such as that of WO2023/118317A, which is incorporated herein in its entirety by reference, by making the discriminator discriminate between reconstructed images of different image qualities (e.g. defined by different bits per pixel, different bit rates, and so on), rather than between reconstructed images and ground truth images, allows the generator (i.e. the compression pipeline) to more easily "trick" the discriminator during training as its task is not to reconstruct an image as close to the ground truth image as possible but to reconstruct an image as close to the slightly higher quality image. This task is harder for the discriminator, and thus, relatively easier for the generator and means the generator and discriminator are more equally balanced in the GAN-setup. This modified approach can be understood by the analogy of a human finding it far more difficult to distinguish between an image compressed and streamed at 200 kbps and an image compressed and streamed at 210 kbps than he might find distinguishing between a 200 kbps image and a 2600 kbps image. In this scenario, the human may frequently be tricked and incorrectly identify which image is which. The same applies to the discriminator network which finds it far more difficult to distinguish between different levels of compression than it does between compressed images and ground truth images. The generator (i.e. the compression pipeline) accordingly learns more easily and training becomes stable and converges successfully after a predetermined number of epochs, even for very low target bitrates such as those associated with the lower end of a bitrate ladder. A particular advantage of this approach is that it allows the compression pipeline to be trained specifically for predetermined image qualities (e.g. defined in bitrates or bits per pixel, and so on). For example, if a video on demand provider uses a 5 rung ladder of 2500 kbps, 1600 kbps, 1000 kbps, 600 kbps, 400 kbps, 200 kbps, the present method allows a compression pipeline to be trained for each rung using the same method, without having training stability and convergence issues for the 600, 400 and 200 kbps rungs. This is explained in more detail below by reference to the Figures. Figure 5 illustrates a rate distortion curve 500 of a learned compression pipeline such as that shown in Figures 1-4. Figures 6 to 8 illustrate a set of discriminators overlayed onto the rate distortion curve of Figure 5. Figures 9 to 12 illustrate a single discriminator overlayed on to the rate distortion curve of Figure 5. The black dots indicate where an output reconstructed image lies on the rate distortion curve. Generally, an output image with a low rate (i.e. low bits per pixel, low bitstream size, low kbps) will have a large number of artefacts and accordingly have a high distortion whereas an output image with a high rate (i.e. high bits per pxiel, high bistream size, high kbps and so on) will have a low number of artefacts and accordingly have a low distortion. The goal of training in the present disclosure is to teach a compression network to reconstruct images that have been compressed to the lower rates to be reconstructed to look like images that have been compressed to the higher rates i.e. have fewer distortions despite having low bits per pixel, low bitstream size, low kbps and so on). Considering first Figure 5, a low rate means better compression (i.e. smaller file size that allows for lower bitrates during data transmission) but comes at the cost of increased distortion which manifests itself in the form of artefacts, reduced image fidelity and so on. In contrast, a high rate means less compression (i.e. bigger file size that requires higher bitrates during data transmission) but it comes with the advantage of reduced distortion. Figure 5 illustrates different points (black dots) on the rate distortion curve at ^^₁, ^^₂, ^^₃, ^^₄, ^^₅, ^^ _{^^ ^^} . At the extreme ends of the curve are ^^₁ and ^^ _{^^ ^^} . ^^₁ corresponds to a heavily compressed input that, when reconstructed during decompression, will contain many artefacts and image inaccuracies whereas ^^ _{^^ ^^} corresponds to a perfect reconstruction of the input that has a large file size but is indistinguishable over the input image. ^^₂, ^^₃, ^^₄, ^^₅ have compression rates between these two extremes. Consider an existing GAN-type training approach with a low target bitrate corresponding to where ^^₁ is on the rate distortion curve. Here, the discriminator’s task of distinguishing between an image at ^^₁ and a ground truth image at ^^ _{^^ ^^} is easy. The input at ^^₁ is filled with artefacts, image inaccuracies and so on. In contrast, the ground truth image at ^^ _{^^ ^^} will have no artefacts. Given how easy the discriminator’s task is, the generator (i.e. the compression pipeline) is unable to "trick" the discriminator and accordingly is unable to learn to reconstruct images that are close in quality to the ^^ _{^^ ^^} images. Now consider the modified approach of the present disclosure. Instead of discriminating between ^^₁ and ^^ _{^^ ^^} , the discriminator’s task is changed to discriminate between images that are much closer on the rate distortion curve to each other, for example ^^₁ and ^^₂, or ^^₁ and ^^₃ and so on. Images at these lower compression rates are all likely to contain artefacts and inaccuracies and it is much more difficult for the discriminator to distinguish between them. Effectively, the generator starting at ^^₁ is now trying to learn to reconstruct images that look as close to images reconstructed at ^^₂ as possible by trying to trick the discriminator that it has produced an image that looks as good as an ^^₂ even though it is using a lower rate ^^₁. The intuition here being that if the generator can be trained to produce higher quality looking images at smaller compression rates ^^ then the generator network ends up being better at compressing and reconstructing images. It is envisaged this approach can be generalised to any number of different rate pairs on the rate distortion curve and that each rate pair can be chained together all the way up the rate distortion curve to the ground truth ^^ _{^^ ^^} , for example by using a separate discriminator for each pair as shown in Figure 6, or using a single discriminator for all reconstruction quality pairs in the set ^^ _{^^, ^^} as shown in Figure 7. In the case where there are multiple discriminators, a first discriminator may act to discriminate between ^^₁ and ^^₂, a second discriminator may act to discriminate between ^^₂ and ^^₃, a third discriminator may act to discriminate between ^^₃ and ^^₄, a fourth discriminator may act to discriminate between ^^₄ and ^^₅ and a fifth discriminator may act to discriminate between ^^₅ and ^^ _{^^ ^^} , and so on. In contrast, when there is a single discriminator, it may be trained to distinguish or give a score to a given reconstructed image indicative of what ^^ of a reconstruction quality pair in the set ^^ _{^^, ^^} the reconstruction is likely to be associated with. Effectively, when there is a single discriminator, the task changes from a classification task of identifying the probability of whether or not a reconstructed image belongs to the class ^^ _^^ or ^^ _^^ and instead becomes a task of giving the reconstructed image a score indicative of where along the rate distortion curve the reconstruction belongs. These single or separate discriminator neural networks may each be trained on a turn by turn basis against the generator (i.e. the compression pipeline), or in a standalone manner. In the case of multiple discriminators, a number of different approaches are envisaged. For example, in one approach, for a first number of training iterations the generator is set to be competing first with the first discriminator associated with the a first quality segment, e.g. the lowest or highest reconstruction quality segment of the rate distortion curve until it is "competent" at being able to beat the first discriminator, before it moves on to competing with the next discriminator and so on up the chain of quality discriminators. For example, in general terms, this approach gives the generator a helping hand to work its way along the early, lower quality reconstruction segment of the rate distortion curve because the discriminators corresponding to the lower rate jumps 501a, 501b, 501c (e.g. between ^^₁ and ^^₂ and between ^^₂ and ^^₃) are easier to trick. By the time the generator has to trick the more difficult discriminators at the higher rate jumps 501d, 501e (e.g. between ^^₄ and ^^₅, and between ^^₅ and ^^ _{^^ ^^} , it is already a "competent" student and is already sometimes able to trick the more difficult discriminator "teachers". In another approach, the generator in each training step is competing against a randomly selected discriminator associated with a corresponding ^^ _{^^, ^^}. Whilst this approach does mean that the generator may struggle in some training steps when the randomly selected discriminator and associated ^^ _{^^, ^^} is a higher quality discriminator and image reconstruction quality, it does give the generator a more complete picture of its task at the higher quality ^^ _{^^, ^^} segments of the rate distortion curve e.g. towards ^^₄, ^^₅, and ^^ _{^^ ^^} . As training goes on, the inventors have found that giving the generator this indication of what its final task entails even at the early stages of training improves the ability of the generator to trick the more difficult discriminators as training goes on. Finally, in the case where a single discriminator is provided it is envisaged that the random approach will also be used. That is, randomly select ^^ _{^^, ^^} and allow the discriminator to score the reconstructed image. More specific details will now be provided. For the purpose of describing the examples of this method, ^^ _^^ may denote a compression pipeline such as that described above, where ^^ indicates the set of trainable parameters of the encoder and decoder, and optionally any other elements such as a hyperencoder, hyperdecoder, flow encoder and decoder, residual encoder and decoder, quantiser and so on. Let’s also denote by ℎ _^^ a discriminator with parameters ^^.

As mentioned above, a known GAN-approach is to apply the discriminator only on the real image ^^ and reconstructed ("fake") image ˆ. That is: ^^_real = ℎ _^^ ( ^^) and ^^_fake = ℎ _^^ ( ˆ).

Modifying this approach in accordance with the example of using multiple discrinators as described above, we now provide a set of discriminators, each associated with a discriminating between different quality image reconstructions. Let ^^ denote our fixed set of discriminators with adjustable parameters, where K is the set of image reconstruction quality pairs ^^ = {( ^^₁, ^^₂), ( ^^₂, ^^₃), ..., ( ^^ _^^, ^^ _^^)} and ^^ = {ℎ¹ ^_^ ^,2 2,3 ^{^^, ^^} 1 , ℎ ^^₂ , ... , ℎ _{^^ ^^} } where ^^₁, ^^₂, ... , ^^ _^^ are the respective parameters. These discriminators do not necessarily have the

E.g. they all could have different architectures, or some of the architectures might coincide and some might not. Let’s assume ˆ _^^1 is an image output by the generator for a target bitrate at ^^₁ on the rate distortion curve of Figure 5, and ˆ _^^2 is an image output by the generator for a target bitrate at ^^₂ on the rate distortion curve, and that we can generalise this using the notation ˆ _{^^ ^^} and ˆ _{^^ ^^} where ^^ and ^^ corresponds to a given target bitrate or bits per pixel (or some other image quality metric) on the rate distortion curve. For each discriminator ℎ _{^^ ^^} , we can calculate the probabilities of ˆ _{^^ ^^} and ˆ _{^^ ^^}: ^^_r ^{^^} ^^, ^^ e_al =ℎ ^^ _^^ ( ˆ _{^^ ^^}) (1)

We use the terms "real" and "fake" to facilitate the comparison and distinction over a conventional GAN approach but note that the "real" image in this case corresponds to the higher rate (i.e. better quality reconstructed) output image at ^^ _^^ on the rate distortion curve and the "fake" image in this case corresponds to the low bit rate (i.e. lower quality reconstructed) image at ^^ _^^ on the rate distortion curve. As explained above, effectively the discriminator’s task here is to distinguish between the lower quality, low rate image and the higher quality image high rate image, while the generator’s task is to produce reconstructed images that are as close to ^^ _^^ on the rate distortion curve as it can to trick the discriminator as often as possible. In one example, it is envisaged that there may be one discriminator ℎ _^ ^{^^} ^^{, ^^} ^_^ with corresponding parameters ^^ _^^ in the set ^^ for each ^^ _^^, ^^ _^^ pair. Alternatively, as described above, the whole set ^^ may be replaced by a single discriminator such as a Wasserstein discriminator (i.e. a Wasserstein GAN) and the task moves away from a classification task to become a scoring task, as will be explained in more detail later. Considering the case of a set of discriminators ^^ first, during each round of training, the generator takes turns competing against one discriminator of the set ^^. This may be the same discriminator for a predetermined number of rounds or until a predetermined success metric (e.g. based on the success rate against that discriminator) is met before moving onto a next discriminator of the set ^^, or it may be a different discriminator of ^^ randomly selected for each training step. Algorithm 1 illustratively shows example pseudocode that outlines one training step of the generator and the set of discriminators. It assumes the existence of three functions ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^, ^^ ^^ ^^ ^^ and ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^. ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ uses backpropagation to compute gradients of all parameters with re- spect to the loss. ^^ ^^ ^^ ^^ performs an optimization step with the selected optimizer. The function ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ensures no gradients are tracked for the function executed. The function ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ refers to how deep learning frameworks such as PyTorch and Tensorflow V2 construct a computational graph that is used for the back-propagation operation. This means that producing ˆ with or without gradients impacts whether or not ^^ _^^ will be part of the computational graph, and therefore whether or not gradients can flow through the generator component. Therefore whether ˆ is produced from ^^ _^^ , with or without gradients matters, for the back-propagation and optimizer update step.

Algorithm 1 Example pseudocode that outlines one training step of the generator ^^ _^^ (i.e. compression pipeline) and the set of discriminators ^^ = {ℎ¹ ^_^1 , ℎ² ^_^2 , ... , ℎ ^{^} ^^{^} ^_^^ }, each for discriminating between reconstructed images ˆ _{^^ ^^} and ˆ _{^^ ^^} of different ^^ _^^, ^^ _^^) of a set ^^ = {( ^^₁, ^^₂), ( ^^₂, ^^₃), ..., ( ^^ _^^, ^^ _^^)}.

Inputs: Input image: ^^ Generator (variable rate compression network): ^^ _{^^ ^^, ^^ , ^^} Generator optimizer: ^^ ^^ ^^ _{^^ ^^} Set of discriminator networks: ^^ = {ℎ¹ ^_^1 , ℎ² ^_^2 , ... , ℎ ^{^} ^^{^} ^_^^ } Set of discriminator optimizers: { ^^ ^^ ^^_ℎ1 ^_^1 , ^^ ^^ ^^_ℎ2 ^_^2 , ... , ^^ ^^ ^^_ℎ ^^ ^_{^ ^^} } Classification loss of discriminator (for "real" i.e. higher quality reconstructions ^^ _^^): L_discr,real Classification loss of discriminator (for "predicted" i.e. lower quality reconstructions ^^ _^^): L_discr,pred Classification loss of generator (for "predicted" i.e. lower quality reconstructions ^^ _^^): L_gen Optional additional loss for generator: L_add Reconstruction: ^^ _{^^, ^^} ← ^^ ^^ ^^ ^^ ^^ ^^( ^^ _{^^, ^^}) ˆ _{^^ ^^} ← ^^ _{^^ ^^ , ^^} ( ^^) ˆ _{^^ ^^} ← ^^ _{^^ ^^ , ^^} ( ^^) Discriminator training: ˆ _{^^ ^^ ,nograd} ← ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^( ˆ _{^^ ^^} ) ^^_d ^{^^} i_scr,real ← ℎ ^{^} ^^{^} ^_^^ ( ˆ _{^^ ^^} )

Generator training: ^^_{gen, pred} ← ℎ ^{^} ^^{^} ^_^^ ( ˆ _{^^ ^^} ) L_adv ← L_gen ( ^^_{gen, pred}) L ← L_adv + L_add ( ^^, ˆ _{^^ ^^} ) ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^(L) ^^ ^^ ^^ ^^( ^^ ^^ ^^ _{^^ ^^} ) As shown in algorithm 1, a set of discriminators ^^ is provided, each for discriminating between reconstructed images ˆ _{^^ ^^} and ˆ _{^^ ^^} of different reconstruction quality pairs ( ^^ _^^, ^^ _^^) of a set ^^ . As an initial training step, a training input image ^^ is provided which is the ground truth image. The generator of algorithm 1 comprises a learned compression pipeline which is configurable as a variable rate compression network having one or more adjustable parameters having values associated with a given compression rate ^^. That is, the generator ^^ _{^^ ^^→ ^^, ^^} is configured to produce output images targeted to have reconstruction qualities anywhere from a lowest target reconstruction quality ˆ _^^1 up to a highest target reconstruction quality ˆ _{^^ ^^ ^^} from an input image ^^, and any reconstruction qualities in between. Here ^^ ^^ indicates a reconstruction quality equal to or indistinguishable from the input ground truth image and 1 indicates a lowest quality image which may go down to an arbitrarily low reconstruction quality such as an output image comprising structureless noise. The target reconstruction quality of the variable rate compression network may be set manually or automatically such as during training jointly with the generator or discriminator as will be explained later herein in the section discussion variable rate compression networks. It is noted that the compression rates may be defined for example in bits per pixel, bitstream size in bits, transmission rate in kbps, and so on may be considered a proxy for image reconstruction quality. More specifically, if the generator is able to trick the discriminator that it has produced an image that looks like a higher quality reconstruction (i.e. bigger bits per pixel, bigger bitstream size, higher transmission rate) but actually only used the compression rate of a lower quality reconstruction then the generator is for all intents and purposes compressing the input image to a smaller file size while still producing reconstructions that are as good as a higher quality image compressed to a larger file size. Returning to algorithm 1, after initialising the generator and receiving the training input image, a random ^^ _{^^, ^^} is sampled from ^^ . That is, a pair of target reconstruction qualities ^^, ^^ is randomly chosen from ^^ and passed to the generator to use as its target reconstruction qualities to produce output reconstructed images ˆ _{^^ ^^} and ˆ _{^^ ^^} . The reconstructed images ˆ _{^^ ^^} and ˆ _{^^ ^^} are then passed to the discriminator associated with the randomly selected ^^ for a first training step. That is, ℎ _^ ^{^^} ^_^^ is applied to ˆ _{^^ ^^} and ˆ _{^^ ^^} to obtain the probabilities for each that they are the "real" (i.e. higher quality reconstruction ^^ _^^) or "fake" (i.e. the lower quality reconstruction ^^ _^^). That is, ^^_d ^{^^} i_scr,real and ^^ ^^ d_iscr,fake are obtained for the reconstructed images. From these probabilities, the discriminator loss L_d ^{^^} i_scr is estimated, backpropagation is performed, and the weights of the discriminator updated with a discriminator optimiser. Note that for the discriminator training step, we do not want to track the gradients of the lower reconstruction quality ^^ _^^ for backpropagation so the ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ function is applied to ˆ _{^^ ^^} before the associated loss is estimated. Note for completeness that the random selection of reconstruction quality pairs from ^^ are envisaged to be neighbouring pairs of reconstruction qualities along the rate distortion curve, e.g. ( ^^₁, ^^₂), ( ^^₂, ^^₃), and so on. However, they may also be non-neighbouring and/or overlapping pairs covering a longer segment along the rate distortion curve, e.g. ^^₁, ^^₃), ( ^^₂, ^^₅), ... and so on. Next, for the corresponding generator training step, we apply the discriminator ℎ _^ ^{^^} ^_^^ to ˆ _{^^ ^^} to obtain the probability that it is lower reconstruction quality ^^ _^^, from which adversarial

loss L_adv is estimated. Optionally, an additional loss may be combined with the adversarial loss such as a distortion loss or a rate loss or some other loss term to obtain the overall loss L. Backpropagation is then performed and the optimiser function is applied to update the weights of the generator network. The above process is then repeated but each time the random sampling of ^^ means that each training step will be associated with a different pair of reconstruction qualities and associated one of the discriminators. Training may then be halted when a predetermined training criteria is met, for example a predetermined number of epochs have been completed, or the training and/or validation loss stops decreasing, and so on. Note also that only the discriminator gets to see both the reconstructions from the higher and lower quality settings. The adversarial loss for the generator’s training step is only estimated from the reconstruction at the lower quality setting whereby the generator is trying to learn to make that lower quality reconstruction be good enough to "trick" the discriminator that it is a higher quality reconstruction. Figure 6 illustratively shows the above approach applied 600 to the rate distortion curve of Figure 5. More specifically, a set of discriminators ^^ is provided with one discriminator associated with each segment between neighbouring reconstruction qualities. That is, there are five discriminators in the set ^^, namely ℎ_1,2, ℎ_2,3, ℎ_3,4, ℎ_4,5, ℎ_{5, ^^ ^^} , each associated with a respective image reconstruction quality

501a, 501b, 501c, 501d, 501e of the rate distortion curve between ^^ _^^ and ^^ _^^. These segments may be based on immediately neighbouring ^^ _^^ and ^^ _^^ pairs (e.g. ^^₁, ^^₂) or based on wider segments of the curve (e.g. ^^₁, ^^₃) and so on. Figure 7 illustratively shows a first training iteration 700 using the above approach whereby the randomly sampled ^^ corresponds to the quality pair ( ^^₁, ^^₂). The associated discrimina- tor is then ℎ_1,2( ˆ _^^1 , ˆ _^^2). Following algorithm 1, the discriminator outputs the classification probabilities used to calculate the discriminator loss, in this case ^^ ^^ ^^ ^^ _{^^ ^^ ^^ ^^:( ^^1, ^^2)} which is used to update the weights of the discriminator, and the classification probability for the generator is used to calculate the generator loss ^^ ^^ ^^ ^^ _{^^ ^^ ^^:( ^^1)} , and used to update the weights of the generator. Note that, as described above, the discriminator gets to see both the reconstructions from the higher and lower quality settings so its loss is based on both ^^₁ and ^^₂ (i.e. did it guess "real" and "fake" correctly) whereas the adversarial loss for the generator’s training step is only estimated from the reconstruction at the lower quality setting so is based only on ^^₁ (i.e. was it able to reproduce an image that was able to trick the discriminator). Figure 8 illustratively shows a second training iteration 800 using the above approach whereby the randomly sampled ^^ corresponds to the quality pair ( ^^₃, ^^₄). The associated discriminator is then ℎ_3,4( ˆ _^^3 , ˆ _^^4). Following algorithm 1, the discriminator outputs the probabilities used to calculate the loss for the discriminator, in this case ^^ ^^ ^^ ^^ _{^^ ^^ ^^ ^^:( ^^3, ^^4)} which is used to update the weights of the discriminator, and probability used to calculate the loss for the generator ^^ ^^ ^^ ^^ _{^^ ^^ ^^:( ^^3)} , used to update the weights of the generator. Note that, as described above, the discriminator gets to see both the reconstructions from the higher and lower quality settings so its loss is based on both ^^₃ and ^^₄ (i.e. did it guess "real" and "fake" correctly) whereas the adversarial loss for the generator’s training step is only estimated from the reconstruction at the lower quality setting so is based only on ^^₃ (i.e. was it able to reproduce an image that was able to trick the discriminator). Whilst not shown, further training iterations are envisaged to be performed, each time sampling randomly from ^^. As described above, the inventors have found that this discriminator quality ladder approach allows the generator to learn how better to produce higher quality reconstructions even at lower target rates. That is, it learns to reconstruct higher quality images (i.e. with fewer distortions) at lower target rates. For example, an ^^₁ image (i.e. low bits per pixel, bitstream size, kbps and so on) will look as good as ^^₂ image, and so on. In the extreme case, this laddered approach allows an ^^₁ that is compressed to arbitrarily low bits per pixel, bistream size, kbps and so on, to look as good as the ground truth image. Algorithm 2 illustrates example pseudocode that outlines one training step of the gener- ator and a single discriminator (i.e. the set of discriminators is replaced with a single discriminator), for example a Wasserstein discriminator. The single discriminator approach has a number of advantages over themultiple discriminator approach. For example, it is more easily maintained in a code base, it also changes the classification task of the multiple discriminators (i.e. each segment has an associated 0 to 1 probability that a reconstructed image is real or fake within that segment) into a scoring task which allows reconstruted images to be compared across multiple segments, when the score intervals are effectively continuous and can be learned by the single discriminator e.g. a Wasserstein discriminator.

Algorithm 2 Example pseudocode that outlines one training step of the generator ^^ _^^ (i.e. compression pipeline) and a single discriminator ℎ _^^, for scoring reconstructed images ˆ _{^^ ^^} and ˆ _{^^ ^^} of different reconstruction quality pairs ( ^^ _^^, ^^ _^^) of a set ^^ = {( ^^₁, ^^₂), ( ^^₂, ^^₃), ..., ( ^^ _^^, ^^ _^^)}. Inputs: Input image: ^^

Generator (variable rate compression network): ^^ _{^^ ^^→ ^^ , ^^} Generator optimizer: ^^ ^^ ^^ _{^^ ^^} Discriminator network: ℎ _^^ Discriminator

^^ ^^ ^^_{ℎ ^^} Classification loss of discriminator (for "real" i.e. higher quality reconstructions ^^ _^^): L_discr,real Classification loss of discriminator (for "predicted" i.e. lower quality

^^ _^^): L_discr,pred Classification loss of generator (for "predicted" i.e. lower quality reconstructions ^^ _^^): L_gen Optional additional loss for generator: L_add Reconstruction: ^^ _{^^, ^^} ← ^^ ^^ ^^ ^^ ^^ ^^( ^^ _{^^, ^^}) ˆ _{^^ ^^} ← ^^ _{^^ ^^ , ^^} ( ^^) ˆ _{^^ ^^} ← ^^ _{^^ ^^ , ^^} ( ^^) Discriminator training: ˆ _{^^ ^^ ,nograd} ← ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^( ˆ _{^^ ^^} ) ^^_d ^{^^} i_scr,real ← ℎ _^^ ( ˆ _{^^ ^^} ) ^^_d ^{^^} i_scr,fake ← ℎ _^^ ( ˆ _{^^ ^^ ,nograd})

Generator training: ^^_{gen, pred} ← ℎ _^^ ( ˆ _{^^ ^^} ) p_red)

L ← L_adv + L_add ( ^^, ˆ _{^^ ^^} ) 62 ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^(L) ^^ ^^ ^^ ^^( ^^ ^^ ^^ _{^^ ^^} ) The steps of algorithm 2 correspond with those of algorithm 1 except now there is a single discriminator, for example a Wasserstein discriminator, instead of a set of discriminators D. For brevity, the steps of algorithm 2 that are identical to algorithm 1 will not be repeated here but it is noted that the task of the discriminator is now not simply to classify the output of the generator into a "real" or "fake" classes ^^ _^^ and ^^ _^^ of the randomly sampled ^^ as it was in algorithm 1, but instead its task is now to score the output of the generator along the segment corresponding to to provide an approximate indication of where on the rate distortion curve the output of the generator lies. Whereby, each segment ^^ _^^, ^^ may have an associated score range of an overall score range covering the entire length of the rate distortion curve. In the context of the adversarial training, the task of the generator accordingly becomes to reconstruct images that have a score corresponding to rate distortion segments associated with a relatively higher quality reconstruction on the rate distortion curve when using a relatively lower target quality setting. Whereas the discriminators task is to give it a score that correctly corresponds with the target quality setting used to reconstruct the image. More specifically, ^^ in algorithm 2 corresponds to the scores output by the single discriminator rather than to classification probabilities. Figure 9 illustratively shows the approach of algorithm 2 applied 900 to the rate distortion curve of Figure 5. As with Figures 6 to 8, the rate distortion curve is divided into segments 501a, 501b, 501c, 501d, 501e. However now a single discriminator ℎ( ˆ _{^^ ^^} , ˆ _{^^ ^^}) is provided which outputs a score indicative of where along the rate

output of the generator. Figure 10 illustratively shows a first training iteration 1000 using the above single discrimi- nator approach whereby the randomly sampled ^^ corresponds to the quality pair ( ^^₂, ^^₃). Following algorithm 2, the discriminator outputs a score used to calculate the loss for the discriminator, in this case ^^ ^^ ^^ ^^ _{^^ ^^ ^^ ^^:( ^^2, ^^3)} which is used to update the weights of the discriminator, and the score used to calculate the loss for the generator ^^ ^^ ^^ ^^ _{^^ ^^ ^^:( ^^1)} , used to update the weights of the generator. Note that, as described

gets to see both the reconstructions from the higher and lower quality settings so its loss is based on both ^^₂ and ^^₃ (i.e. did it score the higher and lower quality settings correctly) whereas the adversarial loss for the generator’s training step is only estimated from the reconstruction at the lower quality setting so is based only on the score for ^^₁ (i.e. was it able to reproduce an image that was scored as a higher quality image despite being created at a lower quality setting). Figure 11 illustratively shows a first training iteration 1100 using the above single discrimi- nator approach whereby the randomly sampled ^^ corresponds to the quality pair ( ^^₄, ^^₅). Following algorithm 2, the discriminator outputs a score used to calculate the loss for the discriminator, in this case ^^ ^^ ^^ ^^ _{^^ ^^ ^^ ^^:( ^^4, ^^5)} which is used to update the weights of the discriminator, and the score used to calculate the loss for the generator ^^ ^^ ^^ ^^ _{^^ ^^ ^^:( ^^4)} , used to update the weights of the generator. Note that, as described above, the discriminator gets to see both the reconstructions from the higher and lower quality settings so its loss is based on both ^^₄ and ^^₅ (i.e. did it score the higher and lower quality settings correctly) whereas the adversarial loss for the generator’s training step is only estimated from the reconstruction at the lower quality setting so is based only on the score for ^^₄ (i.e. was it able to reproduce an image that was scored as a higher quality image despite being created at a lower quality setting). Figure 12 illustratively shows a first training iteration 1200 using the above single discrimi- nator approach whereby the randomly sampled ^^ corresponds to the quality pair ( ^^₂, ^^₅). Following algorithm 2, the discriminator outputs a score used to calculate the loss for the discriminator, in this case ^^ ^^ ^^ ^^ _{^^ ^^ ^^ ^^:( ^^2, ^^5)} which is used to update the weights of the discriminator, and the score used to calculate the loss for the generator ^^ ^^ ^^ ^^ _{^^ ^^ ^^:( ^^2)} , used to update the weights of the generator. Note that, as described above,

see both the reconstructions from the higher and lower quality settings so its loss is based on both ^^₂ and ^^₅ (i.e. did it score the higher and lower quality settings correctly) whereas the adversarial loss for the generator’s training step is only estimated from the reconstruction at the lower quality setting so is based only on the score for ^^₂ (i.e. was it able to reproduce an image that was scored as a higher quality image despite being created at a lower quality setting). In this example, the quality gap is larger than just neighbouring qualities so, early on in training for this randomly sampled ^^ , it may be difficult for the generator to trick the discriminator. However, for later training iterations when the generator is becoming more and more "competent" at tricking the discriminator, these large quality gaps along the curve can push the generator to produce reconstructions that are closer and closer towards the ground truth quality compared to when only neighbouring qualities are used. It is also noted that, for both algorithms 1 and 2 and more generally, the discriminator and generator may be trained entirely separately from each other. For example, a pre-trained discriminator may be provided with frozen weights and the generator trained against the frozen discriminator with the updating of the weights being confined to the generator for each iteration. Conversely, The discriminator may be provided with an already-generated set of training images of different reconstruction qualities that can be sampled from in the "Reconstruction" steps of algorithms 1 and 2 following a random selection of quality pairs from ^^ . As explained above, the approaches shown in algorithm 1 and 2 may optionally use a variable rate compression network. It is envisaged that various implementation approaches of such a network, and a number of alternatives, are possible. Some of these will be described in further detail below. Concept 2: Variable Rate Compression Network As described above, one component that facilitates the implementation of the above described laddered quality (i.e. compression rate) discriminator approach is the variable rate compression network. This is a network such as one or more of the compression pipelines illustrated in Figures 1 to 4 that are configurable with one or more control parameters that can be adjusted to alter the target quality of the reconstructed image output by the network (whereby quality may be defined by e.g. compression rate in bpp, and so on). This contrasts with a non-variable rate network which is trained only to produce the best possible quality images i.e. that looks the closest to the ground truth image. The variable rate compression network allows a target reconstruction quality to be set each time the generator is run during training without any a priori knowledge. Over time, as training progresses, the generator will produce reconstructed images that are consistently grouped together around a set distinct positions on the rate distortion curve, these positions become the eventual target reconstruction qualities such as ^^₁, ^^₂, ^^₃, ^^₄, ^^₅ in Figures 5 to 12. This ability to actively select a target reconstruction

facilitates the use of the laddered quality discriminator approach, but also allows a single network to be trained for use with different bit rate ladders such as those used by typical large scale content streaming platforms. An example implementation of a variable rate compression network will now be described with reference to Figure 13. Figure 13 illustratively shows a compression pipeline 1300 corresponding to that of Figure 2, or the I-frame module 200 of Figure 3, except that now a control unit ^^ 1301 and corresponding inverse control unit ^^⁻¹1302 are provided. On the encode side, the control unit 1301 is positioned before the quantisation module. On the decode side, the inverse control unit 1302 is positioned after the entropy decoder module. The control unit 1301 receives as input the latent representation and scales one or more channels of the latent representation through processing it with with a learned matrix (or vector when considering the latent on a per channel basis). Here processing may refer to multiplication, for example channel-wise multiplication. This has the effect of transforming the latent representation (i.e. modifying its values) before they are quantised and accordingly provides a degree of control over what input the quantisation module receives in turn effects how well or not well a given image will be compressed i.e. it facilitates rate control. More generally, the control unit 1301 may apply a control matrix ^^ ∈ ^^ ^{^^× ^^} where ^^ is the number of channels and ^^ is the number of control vectors ^^ _^^ = { ^^ _^^,0, ^^ _^^,1, ..., ^^ _{^^, ^^−1}} with ^^ denoting the index of the control vectors in the control

^^ _{^^, ^^} ∈ ^^ represents the ^^ th control value in the control vector ^^ _^^ and ^^ ranges

1, and where each channel may be associated with its value or values. It follows that applying the control matrix ^^ changes the latent representation by ^^ _{^^, ^^} = ^^ _^^ × ^^ _{^^, ^^} . More specifically, the operation applied by the control unit 1301 may be

^^ ^^ _^^ ( ^^, ^^) = ^^ ⊙ ^^ _^^ where ^^ _^^ (·) is the control unit’s operation and ⊙ is the channel-wise multiplication. The modification of the latent representation by control matrix ^^ can be understood at a more general level as corresponding to simulating better or worse compression rates by emphasising how many bits the network should assign to one or more of the input channels of the latent representation. At one extreme, if the control matrix ^^ corresponds to the identify matrix for all channels, then the modified latent representation corresponds to the original latent representation and the target reconstruction quality may be close to the ground truth image ^^ _{^^ ^^} . In this case the control matrix ^^ has no effect. At the other extreme, the control matrix ^^ may completely transform all the channels of the latent representation into single-bin, uniform values which may be perfectly losslessly compressed but from which it is practically impossible to reconstruct any meaningful image. Between these two extremes are a set of target reconstruction qualities (i.e. compression rates) which are defined by the control vectors ^^ _^^ of the control matrix ^^ and which emphasise or de-emphasise higher rate or conversely higher distortion during training and which encourage the network to learn to assign more or fewer bits to the various channels of the latent representation. On the decode side, a corresponding inverse control matrix ^^⁻¹ is applied by the inverse control unit ^^⁻¹ 1302 to the reconstructed latent representation output by the entropy decoder before it is passed into the decoder neural network to reconstruct the output image. Similarly to the control matrix ^^, the inverse control matrix ^^⁻¹ comprises a number of inverse control vectors ^^ _^ ⁻ ^ ¹ = { ^^ _^ ⁻ ^_,0 ¹ , ^^ _^ ⁻ ^_,1 ¹ , ..., ^^ _^ ⁻ ^_{, ^} ¹ ^₋₁}, and the operation of the inverse control unit ^^⁻¹1302 can be ^ˆ^ ⊙ ^^ _^ ⁻ ^ ¹. Here ^^⁻ ^_^ ¹ (·) and ⊙ corresponds

to the same channel-wise multiplication

the encode side. As indicated above, the elements of the control matrix ^^ are learned jointly with the other parameters of the compression pipeline (e.g. the weights of the encoders and decoders, and so on). Taking the control matrix ^^ and the inverse control matrix ^^−1 together, it is noted that there will always be pairs of corresponding control vectors { ^^ _^^, ^^⁻ ^_^ ¹ } bound by the corresponding index ^^ whereby one part of the pair is associated with the the encode side and one part is associated with the decode side. Given that the control matrix ^^ and inverse control matrix ^^−1 in general terms operate to simulate higher or lower compression rates by modifying the latent representation, they effectively act as regularisation terms. We can accordingly associate different Lagrange multipliers to be applied to the rate or distortion term of the loss function with each control vector pair during training of the compression pipeline. More specifically, for one training iteration, we select one Lagrange multiplier ^^ _^^ (which will emphasise rate or distortion more for that training iteration), together with its associated pair of control vectors { ^^ _^^, ^^ _^ ⁻ ^ ¹ }, evaluate the loss function with the rate or distortion term regularised by that Lagrange multiplier and control vectors, and finally update the elements of the network including the values of the elements of the control unit and inverse control unit, based on the evaluation of that loss function. For the next iteration, we may select a different Lagrange multiplier and its associated different control vector pairs and repeat the process but this time when we update the elements, it will be for the different control vectors. We then repeat this process throughout training, selecting different Lagrange multipliers and associated control vector pairs each iteration. At the start of training, the randomly initialised control vectors and network weights will have high losses, however after a few iterations loss decreases and the values of the elements of the control vectors start to converge resulting in a set of control vectors { ^^ _^^, ^^⁻ ^_^ ¹ } each associated with a different regularisation of the rate or distortion terms of the loss function, and thus each emphasising a reconstruction of the target image emphasising a different rate and distortion. Given that rate effectively determines how many artefacts a reconstructed image will have (e.g. a measure of quality), the control vectors allow the network to target specific rates and thus target reconstruction image qualities. Accordingly, during inference, a pair of the resulting, learned control vectors of the set can be selected and applied respectively to the latent representation on the encode side and to the output of the entropy decoder on the decode side to emphasise either more rate or more distortion. If more distortion (i.e. a lower rate and corresponding to worse image quality) is desired, the control vector pair associated with the higher distortion regularisation amount may be selected. Applied to the example of Figure 5, this may be associated with e.g. ^^₁. Conversely, if more rate (i.e. better image quality) is desired, the control vector pair associated with a lower distortion regularisation amount may be selected. Applied to the example of Figure 5, we may then reconstruct an image at e.g. ^^₂ or higher, and so on. In this way, the compression pipeline can be controlled to vary the rate (and thus target reconstruction quality) of the reconstructions it produces simply be selecting a different pair of the control vectors to apply to the latent representation and output of the entropy decoder respectively. As indicated above, not only is this approach conveniently suited to the laddered quality discriminator approach described above, but it also works as a standalone technique to train a compression network to output reconstructions at different target reconstruction qualities, such as those corresponding to typical bitrate ladders used by commercial content streaming providers without needing to train separate networks for each bitrate of the ladder. Further training details of the variable rate compression network will now be provided in more detail. Consider the typical rate distortion loss function based on the steps of the pipeline of Figure 1: ^^ = ^^( ^^( ^^ _^^ ( ^^))) + ^^ ( ^^ _^^ ( ^ˆ^)) where ^^ is the input image, ^^ _^^ is the encoder network that produces the latent representation from the input image, ^^ represents the quantisation operation applied to the latent repre- sentation, ^ˆ^ is the reconstructed latent representation, and ^^ _^^ is the decoder network that reconstructs the output image ˆ from the reconstructed latent representation. In order to introduce a dependence of the loss on the control unit and the inverse control unit ^^ _^^ and ^^⁻ ^_^ ¹ , the loss function is modified by the introduction of the regularisation term ^^ _^^ a pair of randomly initialised control vector pairs { ^^ _^^, ^^ _^ ⁻ ^ ¹ } that are linked to the regularisation term by the index ^^, that is: ^^ = ^^( ^^( ^^ _^^ ( ^^ _^^ ( ^^)))) + ^^ _^^ · ^^ ( ^^ _^ ⁻ ^ ¹ ( ^^ _^^ ( ^ˆ^)))

For the sake of example, assume we would like a set of five control vector pairs { ^^₁, ^^⁻ 1¹ }, { ^^₂, ^^⁻ 2¹ }, { ^^₃, ^^⁻ 3¹ }, { ^^₄, ^^⁻ 4¹ },{ ^^₅, ^^⁻ 5¹ }, each pair associated with an

We set up a corresponding set of Lagrange multipliers ^^₁ = 0.5, ^^₂ = 0.05, ^^₃ = 0.005, ^^₄ = 0.0005, ^^₅ = 0.00005. The specific values here are illustrative only where a higher value will emphasise distortion more compared to rate whereas a lower value will emphases rate more over distortion. In the first training iteration, we initialise the control vector pairs (e.g. with random values) and weights of the networks of the compression pipeline and then randomly select an ^^ value, let’s say ^^ = 3, which sets ^^₃ = 0.005, inserts the control vector pair { ^^₃, ^^⁻ 3¹ } into the pipeline at control unit ^^₃ and inverse control unit ^^₃ ⁻¹ , and thus results in the loss function for this iteration as: ^^ = ^^( ^^₃( ^^( ^^ _^^ ( ^^)))) + 0.005 × · ^^ ( ^^₃ ⁻¹ ( ^^ _^^ ( ^ˆ^))) Backpropagation is performed, and the weights of the network and values of the control vector pair { ^^₃, ^^⁻ 3¹ } are updated. For the next iteration, a different ^^ is randomly selected and the process is repeated for a predetermined number of training iterations or until some end of training condition is met (e.g. the training and/or validation loss stops decreasing, and so on). Whilst the initial training steps will not produce good reconstructions at any target image quality, after a few iterations, the output reconstructions associated with each randomly selected ^^ will start to converge to respective positions along the rate distortion curve, such as those shown illustratively in Figure 5. After training is completed, we have the set of set of five control vector pairs { ^^₁, ^^⁻ 1¹ }, { ^^₂, ^^⁻ 2¹ }, { ^^₃, ^^⁻ 3¹ }, { ^^₄, ^^⁻ 4¹ },{ ^^₅, ^^⁻ 5¹ } that, when applied during inference, result in the

target reconstruction qualities on the rate distortion curve. It will be appreciated that one or more bits indicating which target reconstruction quality is being used may be included in the bitstream as metadata. This may be in the main bitstream or, preferably, in the hypernetwork bitstream. In the latter case, the hypernetwork bitstream is decoded first, allowing the the entropy decoded modified latent representation to be processed by the inverse control unit 1302 using the correct control vectors before the processed modified latent representation is fed into the decoder to reconstruct the image at the target image quality. More generally it is envisaged that the different control vectors of the control units once learned will from part of the pipeline’s weights and architecture installed on the encode and decode side so it is not necessary to send all the values of the control vectors. Instead, a simple indication of which vector pairs from the set of vector pairs has been selected may be sent, and a lookup table used to retrieve the correct control vectors during decode. In practice, the effect the control vector pairs have in inference is to modify the latent representation before quantisation to make it more easily or less easily quantised into course or fine quantisation bins. That is, the learned values of the control vector pairs associated with the low rates typically end up with vector values that modify the distribution of the latent representation to be easily quantised into coarse bins whereas the values of the control vector pairs associated with higher rates typically result in a latent representation distribution that requires finer quantisation bins and which can accordingly not be compressed as easily. As described above, the exact values are learned parameters and may accordingly vary depending on the number of vector pairs in the set of vector pairs, as well as the values of the regularisation parameters in the set of regularisation parameters. As indicated above, this variable rate compression network approach using control units can be synergistically combined into the above-described laddered, quality discriminator concept to facilitate training of a compression pipeline. For example, the loss functions described above may include the above-described adversarial loss of the discriminator such that the variable rate compression network and its control units become the generator that competes against the set of discriminators or the single discriminator in algorithms 1 and 2. In this case, the random selection of target quality pairs from the set ^^ also defines which two ^^’s are selected for the control units. For example, if ^^ is selected in one training iteration to be { ^^₂, ^^₃} (i.e. corresponding to the segment on the rate distortion curve between ^^₂ and ^^₃, then the associated ^^ selection for training the control units will be ^^ = 2 and ^^ = 3, which results in the selection of the pairs { ^^₂, ^^⁻ 2¹ } and { ^^₃, ^^⁻ 3¹ }, and the associated Lagrange multipliers ^^₂ = 0.05, ^^₃ = 0.005

iteration. Then in the next training iteration, a different ^^ may be randomly selected and so on. The above values of ^^ _^^ are illustrative only and it will be appreciated that these may take on any value e.g. ^^ _^^ ∈ ^^. It is envisaged that the number of values in set of ^^ _^^s may correspond to the number of target quality rates the network is being trained to produce, thus allowing for adaptability to any level of granularity of segments on the rate distortion curve. For completeness, the final (i.e. highest) target image quality ^^ _{^^ ^^} may correspond to the ground truth image itself, whereby, with reference to Figure 5, if ^^ is randomly selected to be { ^^₅, ^^ _{^^ ^^} }, only one ^^ will be selected (for ^^ = 5) as the ground truth image ^^ itself can substitute the output of the generator for the discriminator to use as a reference comparison. Finally, it will be appreciated that variable rate network approach allows the entire compression pipeline and control units to be trained jointly in an end to end manner which is not only convenient but also results in an overall improved performance as there are no handcrafted features and it learns the optimal network weights and control unit values. An alternative, albeit less elegant approach, is to compare the output of the generator at the lower target quality of the reconstruction pair ^^ with a reference image corresponding to the ground truth image but which has been modified in some way to be slightly lower quality than the ground truth image (e.g. the ground truth image is provided as ^^ _{^^ ^^} and it is classically compressed using for example HEVC or AV1 to various different rates ^^₁₋₅ to provide the versions of the ground truth image for each of the segments of the rate distortion curve against which the discriminator(s) will providing a score or using to determine a classification loss) This approach effectively alters the "Reconstruction" step in algorithms 1 and 2 to reconstructing only the lower target quality image and feeding a corresponding higher target quality image taken from a training data set to the discriminator to compare the reconstruction against. This approach is less preferred as it requires substantially more training data and also in many cases ends up teaching the compression network to reconstruct classical compression image artefacts - something which is not desired. The former approach (i.e. the variable rate compression network approach with control units) is accordingly preferred as it is allows for a similar or better result with far less training data and avoids the baking of classical compression artefacts into the reconstructions. Notwithstanding the above, it is generally envisaged that laddered quality discriminator approach may alternatively use any other approach that allows the discriminator to receive as input two images of varying reconstruction qualities which may include the above described variable rate compression network approach, the increased training data set size approach, or any other approach, including for example the use of multiple, separate networks already trained to produce images at one or more predetermined qualities, which may be used by the discriminator to compare the generator’s output against. Finally, the above control unit approach may also be applied to the hypernetwork such as that shown in Figure 2. More specifically, a separate set of vector pairs may be introduced in control units for the hypernetwork and their values updated during training in the same way as set out above for the control units 1301 and 1302 of Figure 13. Figure 14 illustratively shows a compression pipeline 1400 corresponding to that of Figure 13, except that now a control unit ^^_ℎ 1401 and corresponding inverse control unit ^^^−1,ℎ 1402 are also provided for the hypernetwork. On the encode side, the control unit 1401 is positioned before the hypernetwork quantisation module. On the decode side, the inverse control unit 1302 is positioned after the hypernetwork entropy decoder module. Similarly to how the control unit 1301 in Figure 13 receives as input the latent representation of the input image, the control unit 1401 of the hyper network 1400 receives as input the hyper latent representation, that is, a latent representation of the latent representation, and scales one or more channels of the hyper latent representation by processing it with a learned matrix (or vector when considering the latent on a per channel basis). On the decode side, the inverse control unit 1402 of the hypernetwork performs an inverse operation by processing the output of the hypernetwork entropy decoder module. Processing may refer to multiplication, for example channel-wise multiplication. This processing has the effect of transforming the hyper latent representation (i.e. modifying its values) before they are quantised and accordingly provides a degree of control over what input the hyper network quantisation module receives in turn effects how well or not well a latent representation will be compressed in the hypernetwork. Whilst not shown, this approach can be applied further in a nested manner e.g. to a hyper-hyper network, and so on. As with the control units 1301, 1302 of the main network, the hypernetwork control units 1401, 1402 are defined by pairs of control vectors, each pair associated with a respective regularisation parameter ^^ _^^ during training, one of which is selected for each training step, resulting in the values of the selected pair of control vectors being updated for that training step. It is envsiaged that the control units 1401, 1402 of the hypernetwork may be trained jointly with the control units 1301, 1302 of the main network. By way of illustrative example, and considering the same example as that used in Figure 13, if a given ^^ is selected to be 3, this will set the regularisation parameter to be ^^₃, the control vector pair of the main network to be { ^^₃, ^^⁻ 3¹ }, and the control vector pair of the hypernetwork to be { ^^₃ ^ℎ , ^^^−1,ℎ 3 }. After evaluating

loss function and performing backpropagation, the values of both sets of the control vector pairs are updated and a new ^^ is selected for the next training step. In this way, the control vectors of the control units of both the main network and the hypernetwork can be jointly trained. During inference, the desired rate can be selected by choosing whichever pairs of control vectors correspond to the chosen rate and introducing these in the pipeline at 1301, 1302, 1401, 1402 and so on. Concept 3: Wavelet-space quantisation Moving now to quantisation concepts, as has been described above, STE quantisation is a known approach that is used during training to simulate the effects of the quantisation operations of the compression pipeline in a way that can be backpropagated through. That is, if a rounding function is applied to the latent representation to quantise it in the forward pass, a corresponding backward pass using STE simply assumes that the gradient through the rounding function is a predetermined value (e.g. 1). Rounding in a forward pass and using the STE approach in the backward pass is referred to herein split quantisation training. Whilst the STE approach facilitates the use of backpropagation, it causes a number of problems. Firstly, consider the values of the latent representation. Assuming these are normally distributed around a center (e.g. mean) then we would expect the distribution of the quantisation residuals associated with those latents quantised into a set of quantisation bins (i.e. the difference between the un-quantised values of the latent and the center of the bins they are quantised to) to also be distributed approximately around a zero-center, monotonically decreasing away from that center. The intuition here is that, if the latent values are normally distributed around a center and a suitable quantisation bin location parameter has been learned that corresponds approximately to that center, then the difference between the un-quantised values and their quantised counterparts will be very small and mostly distributed around a zero. An illustrative example of an ideal quantisation residual distribution 1500 is shown in Figure 15 where a peak 1501 around the zero-center 1502 is observed and the distribution monotonically decreases in the directions away from that center. The y-axis here may be count and the x-axis may be quantisation residual of an arbitrary latent representation. This distribution indicates that there are relatively few outliers. However, the inventors have realised that when STE quantisation is implemented in practice, for example as part of a split quantisation scheme, the quantisation residual distribution suffers from a phenomena where the tails of the distribution increase rather than continue to monotonically decrease. This phenomena is illustrated in the quantisation residual distribution 1600 shown in Figure 16 where the peak 1601 is shown in the center 1602 of the distribution and the tails 1603, 1604 of the distribution 1600 can be seen to increase towards the edges of the distribution. As above, The y-axis here may be count and the x-axis may be quantisation residual of an arbitrary latent representation. This phenomena can be understood as corresponding to an unexpectedly high number of latent values being quantised to centers of bins that are very far from their un-quantised ground truth value. The consequence of this is that the quantised latent representations have not been quantised as well as the theoretically ideal distribution of Figure 15 suggests they ought to be. In turn, this slows or prevents the loss from decreasing as quickly or as far as it might otherwise do as training progresses. In order to mitigate the quantisation residual increasing tail problem, the present disclosure applies a wavelet transform to latent representations before they are quantised, and an inverse wavelet transform to the reconstructed latent representations before they are decoded back into an image. In general terms, this pushes the part of the compression pipeline after the encoder into wavelet space where the inventors have found that the increasing tail phenomena described above does not occur. The result of this is that the quantisation residual distribution more closely follows the ideal distribution of Figure 15 with far fewer outliers. The loss during training using this approach accordingly decreases more quickly and converges to lower values than without the wavelet transform. Figure 17 illustrates a compression pipeline 1700 corresponding to that shown in Figure 2 except the above wavelet transform approach has been implemented. Specifically, a wavelet transform module 1701 which performs a wavelet transform ^^ is introduced after the encoder before quantisation is performed and before the hyper latent representation is produced, and a corresponding inverse wavelet transform module 1702 which performs an inverse wavelet transform ^^⁻¹ is introduced before the decoder. That is, the transform is applied to the latent before feeding it into the hypernetwork. Entropy model parameters can then be directly predicted in the transformed space. Applying orthogonal transforms on 2 × 2 blocks reshapes the image latent tensors from shape (1, C, H, W) to shape (1, 4C, H/2, W/2) and we can combine the new dimensions into the channel dimension. Hence, we are effectively adding an additional downsample into the network architecture, which has several implications. First, the question arises of whether or not the rest of the network is left as is, effectively forcing the pipeline to operate in a smaller (i.e. more downsampled latent space) downstream of the transform. Or alternatively, to remove remove a downsample/upsample pair from the layers of the hypernetwork in order to maintain the spatial footprint of other latents in the network. Both approaches are equally viable and for the remainder of the present disclosure it is envisaged that the latter approach is used. This is because it is simpler to implement and results in fewer modifications to the rest of the network architecture. However, it comes with some additional choices to be made on network architecture regarding how many channels are desired in the convolutional layers that receive downsampled input, and so on. However, the inventors have found that, even if the number of channels is increased by a factor of 4 wherever the Haar transform has downsampled the usual input to a layer (so that the number of elements in the output of the layer stays the same), the impact on run time is smaller than expected by an amount depending on the hardware device being used. This in turn facilitates a high degree of flexibility in how and where the Haar transforms may be implemented in the architecture. An alternative arrangement is shown in Figure 18 which illustrates the same compression pipeline as in Figure 17 except that in this pipeline 1800 the wavelet transform module is applied only to the the input of the quantisation module i.e. in Figure 18, the the hyper network receives as input the original latent representation that has not had a wavelet transform applied to it. That is, the wavelet transform is applied immediately before quantisation and rate calculation, and the entropy model parameters are predicted in the un-transformed space before being mapped appropriately by an orthogonal transform. Note also that we are not limited to applying the transform to the ^^ latent: we can also apply it to the hyper latent ^^ and, if present, a hyper hyper latent ^^, and so on. Applying a transform to the latent with the fully factorised entropy model has the particular appeal of increasing model capacity: as we use 4× as many channels, in this case we actually increase the number of parameters in the model. This further arrangement is shown in Figures 19a and 19b which illustrate the same compression pipeline as in Figures 17 and 18 respectively but now the pipeline 900 additionally incorporates a wavelet transform module 1901 and an inverse wavelet transform module 1902 in the hyper-network respectively positioned after the hyper-encoder and before the hyper-decoder. The same increasing tail effects that occurs in the quantisation residual distributions of latent representation of the main network also occur in the quantisation residual distributions of the hyper latent representation in the hyper network. Accordingly, this effect can equivalently be mitigated in the hyper-network by following the same approach. Further details of the implementation of the wavelet transform performed by the wavelet transform module (and its corresponding inverse) are now described. Introducing the wavelet transforms into the quantisation scheme can be described as follows: ^ˆ^ = ^^ ⌊ ^^( ^^ − ^^)⌉ + ^^, where ^^, ^^ are ^^ × ^^ matrices, where ^^ = ^^ · ^^ · ^^ is the product of the non-batch dimensions of the latent ^^, and where ^^ corresponds to the location parameter (i.e. center) of a respective quantisation bin. More specifically where ^^ = ^^⁻¹, these operations are inverses to one another and we may consider rewrite the above equation as: ^_{ˆ^ = diag(Δ) ^^} ^{⊤ ⌊} ^_{^ diag(Δ)} ⁻¹ _{( ^^ − ^^)} ^⌉ +_^^, where ^^ is an ^^ × ^^ orthogonal matrix ( ^^⁻¹ = ^^^⊤). In two dimensions, orthogonal matrices can be classified into rotations and reflections, but the geometric picture becomes more complex for higher dimensions (compositions of multiple reflections may be required). In all cases, they are isometries. That is, they leave the Euclidean norm of the vectors they act on unchanged. Considering that the full generality of isometries on the space that ^^ lives in is a vast space, we instead restrict our attention to smaller subsets of the orthogonal matrices. Two envisaged options are: (i) Performing the same orthogonal transformation over the channel dimension for each spatial position. This corresponds to applying a ^^ × ^^ orthogonal matrix ^^_ch to a reshaped version of ^^ in R ^{^^× ^^· ^^} , or equivalently taking ^^ = ^^_ch ⊗ ^^, where ⊗ is the Kronecker product and ^^ is the identity matrix of size ^^ · ^^ . (ii) Performing the same orthogonal transformation over 2 × 2 spatial blocks. This corresponds to applying a 4×4 orthogonal matrix ^^_2×2 to a reshaped version of ^^ in R^{4× ^^/4}, or equivalently taking ^^ = ^^( ^^_2×2 ⊗ ^^), where ^^ is a permutation matrix that orders the dimensions of ^^ into a series of 2 × 2 spatial blocks and ^^ is the identity matrix of size ^^/4. This is the framework which the remained of the description focuses on, albeit the same general ideas apply to the first framework. In both cases, the full matrix ^^ is also an orthogonal matrix as the Kronecker product of two orthogonal matrices is orthogonal. The orthogonal matrices can not be continuously parameterised as they have two connected components (corresponding to det ^^ = ±1. If we restrict our attention to those with positive determinant (the special orthogonal matrices), we can parametrise them in several ways. One such way is the Cayley Transform: let ^^ be a skew symmetric matrix ( ^^⊤ = − ^^). Then if we define ^^ as ^^ = ( ^^ − ^^) ( ^^ + ^^)⁻¹, it is a special orthogonal matrix. In fact, any special orthogonal matrix ^^ which does not have −1 as an eigenvalue (so that ^^ + ^^ is invertible) can be expressed in this form by writing: ^^ = ( ^^ − ^^) ( ^^ + ^^)⁻¹. One natural orthogonal transform on 2×2 blocks is the Haar transform, which is an example of a wavelet transform. The 2D Haar transform is most simply viewed as a set of four linear operations on each 2 × 2 block of an image or other two-dimensional signal such as a latent representation of an image. Given a block: © ^^₁₁ ^^₁₂ ^ª ^® ^® _® ^,

We can define the approximation of ^^ as 1 ^^ = ( ^^₁₁ + ^^₁₂ + ^^₂₁ + ^^₂₂) Applying this to every 2 × 2

results in a (rescaled) 2× downsampled version of the original the vertical difference of ^^ as (^v) 1 ^^ = ( ^^₁₁ − ^^₁₂ + ^^₂₁ − ^^₂₂). This highlights differences

difference of ^^ as (^h) 1 ^^ = ( ^^₁₁ + ^^₁₂ − ^^₂₁ − ^^₂₂). This highlights differences

difference of ^^ as (^d) 1 ^^ = ( ^^₁₁ − ^^₁₂ − ^^₂₁ + ^^₂₂). If we view ^^ as a vector ( ^^₁₁, ^^₁₂, ^^₂₁, ^^₂₂)^⊤, we can concatenate the four elements above ^{to write} ^© ^^ ^{ª ©} 1 1 1 1 ^{ª ©} ^^₁₁ ^ª or

= If a 2D signal such as an image or latent representation is rearranged into a concatenation of blocks in vector form, i.e. ^^ = ( ^^ ⁽¹⁾ | · · · | ^^ ^{( ^^/4)}), the Haar transform of the full

^^′ = ^^ ^^. Equivalently, we can retain the original 2D spatial structure of ^^ and perform a grouped convolution operation, with convolution kernels corresponding to the components of the Haar transform detailed above: 1 ^© 1 1 ^ª 1 ^© 1 −1 ^ª 1 ^© 1 1 ^{ª ©} 1 −1 ^ª

Indeed it is are wavelet transform module 1701 in Figure 17, and their inverse which is applied by the inverse wavelet transform module 1702. Whilst the above example uses Haar transforms, it is noted that the orthogonality of the resulting linear transformation is the primary property of importance here and it is accordingly envisaged that other transforms that have this property may alternatively be used. For completeness, it is noted that, whilst Haar transforms are known from traditional image compression, including being used in standards such as JPEG2000 where they replaced the Discrete Cosine Transform (DCT) in the original JPEG standard, they are used in a very different way and for a very different purpose. For example, they are used to decompose an image directly into components to facilitate the removal of high frequency components to compress the image. In contrast, the Haar transform operation applied in the present disclosure acts to decompose a latent representation (and/or a hyper latent representation) into different components, all of which are retained, in a wavelet space where the increasing tail problem unexpectedly does not appear to occur. The consequence of the use of the Haar transform in the present disclosure is accordingly to allow the loss to drop more quickly and lower during training, not to facilitate the removal of e.g. high frequency components. Thus, the purpose, implementation, and effects of Haar transforms in AI-based compression are significantly different and wholly unrelated to how and where Haar transforms are used in traditional compression. Figure 20 illustratively shows a plot 2000 of loss curves from training experiments including:(i) a baseline, control training run 1001 without any wavelet transforms applied, (ii) a training run 2002 with a wavelet transform (and later its inverse) applied only to the latent produced by the encoder of the main network, (iii) a training run 2003 with a wavelet transform (and later its inverse) applied only the hyper latent of the hyper network, (iv) a training run 1004 with a wavelet transform (and later its inverse) applied only to the hyper hyper latent of the hyper hyper network, and finally (v) a training run 2005 applied to all of the latent, the hyper latent, and the hyper hyper latent. Each of the training runs were performed for a predetermined number of steps and a drop in learning rate was implemented after a predetermined number of steps towards the end of the training. In each case, the wavelet transform was the Haar transform and the inverse wavelet transform was the corresponding inverse Haar transform. As is shown in Figure 20, applying the wavelet transform separately to any one of the latent, the hyper latent and the hyper hyper latent in each case results in both a quicker drop in loss at the start of training compared to the baseline control training run as well as an lower final loss at the end of training. Additionally, and particularly advantageously, the improvements are cumulative. Thus, when the Haar transform is applied across all of the latent, hyper latent and hyper hyper latent, the overall reduction in final loss (indicated by Δ ^^ ^^ ^^ ^^) is significant, for example around 10% compared to the baseline, control training run. With the biggest improvement arising from the application of the transform to the latent representation of the main network. As explained above, this improvement may be attributed to the mitigation of the quantisation residual increasing tail phenomena whereby the latent (and/or as hyper latent and/or hyper hyper latent, as applicable) without the transform suffer from an unexpected increase in large quantisation residuals (i.e. worse quantisation) compared to when the transform is present. Additionally, a synergistic advantage of this approach to mitigating the quantisation residual increasing tail phenomena is that wavelet transforms, and particularly Haar transforms, are extremely efficient to implement at a hardware level and accordingly mitigate the problem of the quantisation residual increasing tails at effectively no additional runtime cost. This is particularly advantageous for implementing the compression pipeline on mobile devices such as smartphones, tablets, and laptops. Concept 4: Orthogonal space quantisation More generally, it will also be appreciated that the above approach may be generalised to a larger class of invertible matrices that permit scaling. As will be explained in more detail below. Specifically, given the performance improvement provided by introducing the wavelet transform such as a Haar transform into the compression pipeline as described above, a question arises of whether or not it is possible to learn a general orthogonal transform that improves performance even more than wavelet transforms such as the Haar transform? To answer this question, consider the experimental results shown in Figure 21. Figure 21 shows a plot 2100 of loss curves from training experiments including: (i) a baseline, control training run 1101 without any wavelet transforms applied, (ii) a training run 2102 where the transform is a randomly generated orthogonal matrix transform, (iii) a run 2103 where the wavelet transform is the Haar transform„ and (iv) a training run 2104 with where the wavelet transform is replaced with a fixed (i.e. the same for all images) but learned orthogonal transform (parameterised by e.g. the Cayley transform method described above, and e.g. initialised from the Haar transform as the starting point during training). As with Figure 20, each of the training runs were performed for a predetermined number of steps and a drop in learning rate was implemented after a predetermined number of steps towards the end of the training. As can be seen in Figure 21, the run 2104 with the learned transform slightly outperforms the run where the Haar transform is used (i.e. it converges to a slightly lower loss than the other runs), whereas the randomly generated matrix only slightly outperforms the baseline. Accordingly we can infer that: (i) the Haar transform is a good choice of orthogonal transform, (ii) the learned transform initialised from e.g. the Haar transform does provide some small, marginal improvements, and (iii) there is surprisingly a benefit to doing some block/matrix transform (i.e. any orthogonal block/matrix transform), given the improvement that the randomly generated matrix has over the baseline. Additionally, the above experiments can be extended by replacing the fixed transform (i.e. the same transform for all images) to a predicted transform (i.e. the transform is different and depends on the input image). In this case, it was found that the predicted transforms on a per image basis provided only marginal improvements over the baseline and performed worse than the randomly generated transform, the Haar transform, and the fixed learned transform. Explained in another way, there are different approaches and options that can be taken when implementing the present approach. The first is whether to apply Haar just for quantisation and rate calculation (illustrated in Figure 18) or also for what we feed into the hyper network (illustrated in Figure 17). The predicted transform approach is envisaged to be used with the Figure 18 approach, but the Figure 17 approach performs better. Nevertheless, we can still make a like for like comparison of the approaches of Figure 17 with the different Figure 18 (i.e. transform for quantisation only) approach. That is, both of these approaches have to make a choice about how to deal with the scale parameter as described in the section below. That is: 1. modify the architecture to model the scale parameter in the transformed space 2. model the scale parameter in the un-transformed space and add a simplifying diagonal assumption 3. model the full covariance matrix in the untransformed space. The inventors have found that the first approach outperforms the other approaches. That is, option (1) is the preferred approach. This is an independent question to fixed transforms vs predicted transforms. A priori the predicted transform may work in some cases with approach (1) but in practice it the increase in performance is not as great. This is explained in further detail below. As is known, learned compression pipelines may use hyper-networks to send additional information such as, for example, location parameter ^^ and scale parameter Δ information, and, where the latent’s distribution is defined by e.g. a Laplacian distribution, Laplace scale parameter information. This information is used to entropy encode and entropy decode the bitstreams thereby facilitating the prediction of the ^^, Δ and for a given input image. When applying the transforms as described above, the predicted entropy parameters may be predicted in the untransformed space as usual and the transformations may be performed on the latent immediately before quantisation and rate calculation. For ^^ and Δ, it is clear how one should incorporate them with the orthogonal transformation ^^. However, for the Laplace scale parameter ^^, it is much less clear. For mathematical convenience, we will work with variance, which is is equal to 2 ^^² for a univariate Laplace distribution. If we consider a set of flattened 2 × 2 blocks ^^ ^{( ^^)}; ( ^^ ^{( ^^)} 1₁ , ^^ ^{( ^^)} 1₂ , ^^ ^{( ^^)} 2₁ , ^^ ^{( ^^)} 2₂ )^⊤ ^^ = 1, ... , ^^

sampled from a zero-mean distribution, the empirical within-block covariance matrix is given by ^^ = ^^ ^^⊤, where ^^ = ( ^^ ⁽¹⁾ | · · · | ^^ ^{( ^^)}) is the data matrix for the blocks.

The transformed data matrix is given by ^^′ = ^^ ^^ and the covariance matrix for the transformed data is ^^′ = ^^ ^^ ^^⊤ ^^⊤ = ^^ ^^ ^^⊤, with elements given by ^^′ ∑^ ∑^ ^^_{^ ^^} = ^^ _{^^ ^^} ^^ _{^^ ^^} ^^ _{^^ ^^} . ^^ ^^

In our factorised entropy model, we essentially assume that the covariance matrix in the space that we do our rate calculation in is diagonal. Thus, as we perform the rate calculation in the ^^-transformed space, we must have; ^{^ ^ ^ ^} ∑ ∑ ^^{^} ^^ _^^ ^^ _{^^ ^^} ^^ _{^^ ^^} ^^ _{^^ ^^} ^^ = ^^ ,

We now have two choices: If we wish to make use of regular scale predictions for the entropy encode and decode, we are making the additional assumption that ^^ is diagonal, yielding ^^{^ ^ ^} ∑ ^^{^} _^^ ^^2 ^_{^ ^^} ^^ _{^^ ^^} ^^ = ^^ ,

Otherwise, the network has to model the full covariance matrix ^^ in the original space. This requires 10 parameters per 2 × 2 block (as covariance matrices are symmetric) rather than the usual 4, which is a fairly modest increase. Note ^^ must also be positive definite, so it is easier to predict the Cholesky factor ^^ (a lower triangular matrix with positive diagonal entries) such that ^^ = ^^ ^^⊤ in order to enforce this. It is worth noting that while this analysis holds up well for the fixed, learned transforms where the data matrix ^^ has shape 4 × ^^/4, it is somewhat degenerate for the per-image, predicted case, where we have to predict the covariance matrix for each block from a single observation. In this case, the resulting covariance matrix has rank 1 and will never be diagonal. Accordingly, when the predicted transform (i.e. a different transform for each image) approach is applied, the gains are only marginal over the baseline. Effectively, the advantages that are provided by the fixed, learned transform and the Haar transform, break down for a predicted, per-image transform. Finally, when considering whether or not to apply the predicted transform approach in the transformed space (i.e. Figure 19b - where all further transforms are applied after the wavelet transform 1701) or in the original space (i.e. Figure 19a - where the hyper-network transform is applied to the output of the hyper encoder which receives the un-transformed latent in original space), the former is preferred as it provides slightly better performance gains over the latter. That is, it is found to be preferable to use architectures where we directly predict the scale parameters in the transformed space rather than explicitly transforming a scale prediction made in the original space. Concept 5: Invertible space quantisation At an even more general level, it is also envisaged that benefits may derived from a larger class of invertible matrices that permit scaling. In the case of predicted transforms, these would absorb our usual scale parameter Δ predictions. That is, an arbitrary real matrix ^^ can be broken down into its singular value decomposition (SVD): ^^ = ^^Σ ^^⊤ where ^^, ^^ are orthogonal and Σ is diagonal and non-negative. Thus, if we have a parametri- sation of the special orthogonal matrices, we can use this to construct a parametrisation of the invertible matrices with positive determinant.

In practice, it was found that there were some performance gains when using any, general invertible transforms over the orthogonal transforms described above (in both the fixed, learned; and per-image, predicted, configurations). However, these are small and have several drawbacks relating to run-time, training stability and code complexity. Nevertheless, given that this generalised approach still provides performance gains, it is envisaged to be used in cases where the above drawbacks are of less importance than the performance gains. While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in 25 multiple examples separately or in any suitable sub-combination. Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the examples described above should not be understood as requiring such separation in all examples, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. The subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal. The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable devices, that includes one or more processors and computer readable media, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. The subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet. The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in multiple examples separately or in any suitable sub-combination. Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the examples described above should not be understood as requiring such separation in all examples, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Claims

CLAIMS 1. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a corresponding a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation at a targtet image quality of the input image; evaluating a function based on a difference between the output image and a corre- sponding image having an image quality different to the target image quality; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.

2. The method of claim 1 wherein the target image quality is defined by a target compression rate.

3. The method of any of claims 1 or 2, wherein the difference between the output image and the corresponding image having an image quality different to the target image quality is determined based on the output of a third neural network acting as a discriminator between the output image and the corresponding image.

4. The method of any of claims 1 to 3, wherein the target image quality and the image quality of the corresponding image for a first iteration of said steps are different to the target image quality and the image quality of the corresponding image for a second iteration of said steps.

5. The method of any of claims 1 to 4, comprising selecting the target image quality by selecting a regularisation parameter value from a set of regularisation parameter values.

6. The method of claim 5, wherein the function is a loss function comprising a rate term and a distortion term and wherein the regularisation parameter value is associated with the rate term or the distortion term.

7. The method of any of claims 5 to 6, comprising: selecting a pair of vectors from a plurality of vector pairs and processing the latent representation using one vector of the selected pair to produce a modified latent representation; processing the modified latent representation using the other vector of the selected vector pair and decoding the processed modified latent representation using the second neural network to produce the output image, wherein each pair of vectors from the plurality of vector pairs is associated with a corresponding one regularisation parameter value of the set of regularisation parameter values.

8. The method of claim 7, wherein said selection of the regularisation parameter value and corresponding pair of vectors comprises a random selection.

9. The method of claim 8, wherein the selection is different for a first iteration of said steps compared to a second iteration of said steps.

10. The method of any of claims 1 to 9, wherein the difference between the output image and the corresponding image is determined based on an output of at least one third neural network of a plurality of third neural networks acting as respective discriminators for different target image qualities.

11. The method of any of claims 3 to 10, wherein the target image quality for a first iteration of said steps is different to a target image quality for a second iteration of said steps, and wherein each target image quality is associated with a corresponding at least one third neural network of the plurality of third neural networks.

12. The method of claim 11, wherein the target image quality for a third iteration of said steps is different to a target image quality of the first and/or second iteration of said steps.

13. The method of claim 12, wherein at least one target image quality corresponds to a copy of the input image.

14. The method of any of claims 3 to 13, wherein at least one of the one or more third neural networks is configured to classify the output image into one or more image quality classes.

15. The method of claim 14, wherein the difference on which the function is based is comprises a classification of the output image into the one or more image quality classes.

16. The method of any of claims 3 to 15, wherein at least one of the one or more third neural networks is configured to score the output image against the corresponding image.

17. The method of claim 16, wherein the difference on which the function is based comprises the score output by the third neural network.

18. The method of claim 16 or 17, wherein at least one of the third neural networks comprises a Wasserstein discriminator.

19. The method of any of claims 3 to 18, comprising updating the parameters of at least one of the third neural networks based on the evaluated function in a first of said steps.

20. The method of claim 19, comprising updating the parameters of the first neural network and the second neural network based on the evaluated function in a second of said steps.

21. The method of any of claims 1 to 20, wherein the target image quality is defined in bits per pixel.

22. The method of any of claims 1 to 21, comprising entropy encoding the latent representation into a bitstream having a bit length, and wherein the target image quality is based on the bit length of the bitstream.

23. The method of any of claim 1 to 22, wherein the target image quality is defined by a number of image artefacts in the output image.

24. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural networks to produce a latent representa- tions; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation at a first target image quality of the input image; evaluating a function based on a difference between the output image and a previously decoded image, the previously decoded image being an approximation of the input image at a second target image quality; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.

25. A data processing apparatus configured to perform the method of any of claims 1 to 24.

26. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; selecting a pair of vectors from a plurality of vector pairs and processing the latent representation using one vector of the selected pair to produce a modified latent representation; processing the modified latent representation using the other vector of the selected vector pair and decoding the processed modified latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; evaluating a function based on a difference between the input image and the output image, wherein the function comprises a rate term and a distortion term, and wherein one or both of the rate term or distortion term is regularised by a regularisation parameter associated with the selected vector pair; updating the parameters of the first neural network, the second neural network, and the values of the selected vector pair based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network, a second trained neural network, and a learned plurality of vector pairs.

27. The method of claim 26, wherein the selection of the pair of vectors and associated regularisation parameter is a random selection from the plurality of vector pairs and associated regularisation parameters.

28. The method of claim 26 or 27, wherein the selection of the pair of vectors and associated regularisation parameter in a first iteration of said steps is different to the selection in a second iteration of said steps.

29. The method of claim 28, wherein the selection of the pair of vectors and associated regularisation parameter in a third iteration of said steps is different to the selection in the first and/or second iteration of said steps.

30. The method of any of claims 26 to 29, wherein each pair of vectors and associated regularisation parameter is associated with a target reconstruction quality of the output image.

31. The method of any of claims 26 to 30, wherein each pair of vectors and associated regularisation parameter is associated with one of a plurality of target bitrates of a video encoding bitrate ladder.

32. The method of any of claims 26 to 31, comprising: encoding the latent representation using a third neural network to produce a hyper latent representation; selecting a second pair of vectors from a plurality of vector pairs and processing the hyper latent representation using one vector of the selected pair to produce a modified hyper latent representation; processing the modified hyper latent representation using the other vector of the selected vector pair and decoding the processed modified hyper latent representation using a fourth neural network.

33. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; processing the latent representation using one vector of a selected pair of vectors to produce a modified latent representation; transmitting the modified latent representation to a second computer system; and processing the modified latent representation using the other vector of the selected vector pair and decoding the processed modified latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image at a target image quality associated with the selected vector pair.

34. The method of claim 33, comprising transmitting from the first computer system to the second computer system information indicative of the selection of the vector pair.

35. A method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; processing the latent representation using one vector of a selected pair of vectors to produce a modified latent representation; transmitting the modified latent representation to a second computer system.

36. A method for lossy image or video decoding, the method comprising the steps of: receiving a modified latent representation at a second computer system; processing the modified latent representation using a vector of a selected vector pair and decoding the processed modified latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image at a target image quality associated with the selected vector pair.

37. A data processing system configured to perform the method of any of claims 26 to 36.

38. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; applying a first wavelet transform to the latent representation to produce a transformed latent representation; quantising the transformed latent representation; applying a second wavelet transform to the quantised transformed latent representation to produce the latent representation; decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. evaluating a function based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.

39. The method of claim 38, wherein the first wavelet transform and the second wavelet transform comprise Haar transforms.

40. The method of claim 38 or 39, wherein the second wavelet transform comprises an inverse of the first wavelet transform.

41. The method of any of claims 38 to 40, wherein quantising the transformed latent representation comprises applying a rounding function to the latent representation.

42. The method of claim 41, wherein updating the parameters of the first neural network and the second neural network based on the evaluated function comprises back-propagating a gradient of the function through the rounding function using straight through estimation.

43. The method of claim 42, wherein using straight through estimation comprises setting an incoming gradient at the rounding function equal to a constant.

44. The method of any of claims 38 to 43, comprising: encoding the latent representation using a third neural network to produce a hyper latent representation; applying a third wavelet transform to the hyper latent representation to produce a transformed hyper latent representation; quantising the transformed hyper latent representation; applying a fourth wavelet transform to the quantised transformed hyper latent representation to produce the hyper latent representation decoding the hyper latent representation using a fourth neural network; and using the decoded hyper latent representation to decode the latent representation.

45. The method of claim 44, wherein the fourth wavelet transform comprises an inverse of the third wavelet transform.

46. The method of claim 44 or 45, wherein encoding the latent representation using the third neural network comprises encoding the transformed latent representation.

47. The method of any of claims 44 to 46, comprising: encoding the hyper latent representation using a fifth neural network to produce a hyper hyper latent representation; applying a fifth wavelet transform to the hyper hyper latent representation to produce a transformed hyper hyper latent representation; quantising the transformed hyper hyper latent representation; applying a sixth wavelet transform to the quantised transformed hyper hyper latent representation to produce the hyper hyper latent representation decoding the hyper hyper latent representation using a sixth neural network; and using the decoded hyper hyper latent representation to decode the hyperlatent representation.

48. The method of claim 47, wherein the sixth wavelet transform comprises an inverse of the fifth wavelet transform.

49. The method of claims 38 to 48, wherein the step of quantising modifies one or more elements of the transformed latent representation by a quantisation residual amount, and wherein said quantisation residual amount is reduced by said applying of the first wavelet transform to the latent representation.

50. The method of any of claims 38 to 49, wherein the step of quantising is applied to the whole transformed latent representation without removing any elements of the transformed latent representation, and wherein said decoding is applied to the whole latent representation without removing any elements of the latent representation.

51. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; applying a first wavelet transform to the latent representation to produce a transformed latent representation; quantising the transformed latent representation; transmitting the quantised transformed latent representation to a second computer system; applying a second wavelet transform to the quantised transformed latent representation to produce the latent representation; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

52. The method of claim 51, wherein the first wavelet transform and second wavelet transform comprise Haar transforms.

53. The method of claim 51 or 52, wherein the second wavelet transform comprises an inverse of the first wavelet transform.

54. A data processing apparatus configured to perform the method of claims 38 to 53.

55. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representa- tion; applying a first orthogonal transform to the latent representation to produce a trans- formed latent representation; quantising the transformed latent representation; applying a second orthogonal transform to the quantised transformed latent represen- tation to produce the latent representation; decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. evaluating a function based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.

56. The method of claim 55, wherein the second orthogonal transform comprises an inverse of the first orthogonal transform.

57. The method of claim 55 or 56, wherein the first and second orthogonal transforms are respectively defined by an orthogonal matrix, and wherein the method further comprises updating the values of each said orthogonal matrix based on the evaluated function; and repeating said steps to produce said learned values of each said orthogonal matrix.

58. The method of claim 55 or 56, wherein the first and second orthogonal transforms are respectively defined by an orthogonal matrix comprising random values.

59. The method of claim 57, wherein one or more values of each said orthogonal matrix is defined by a dependency on the input image.

60. The method of claim 57, wherein one or more values of each said orthogonal matrix is independent of the input image.

61. The method of claims 55 to 60, wherein the step of quantising modifies one or more elements of the transformed latent representation by a quantisation residual amount, and wherein said quantisation residual amount is reduced by said applying of the first orthogonal transform to the latent representation.

62. The method of any of claims 55 to 61, wherein the step of quantising is applied to the whole transformed latent representation without removing any elements of the transformed latent representation, and wherein said decoding is applied to the whole latent representation without removing any elements of the latent representation.

63. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; applying a first orthogonal transform to the latent representation to produce a trans- formed latent representation; quantising the transformed latent representation; transmitting the quantised transformed latent representation to a second computer system; applying a second orthogonal transform to the quantised transformed latent represen- tation to produce the latent representation; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

64. A data processing apparatus configured to perform the method of claims 55 to 63.

65. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representa- tion; applying a first invertible transform to the latent representation to produce a trans- formed latent representation; quantising the transformed latent representation; applying a second invertible transform to the quantised transformed latent representa- tion to produce the latent representation; decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image. evaluating a function based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.

66. The method of claim 65, wherein the second invertible transform comprises an inverse of the first orthogonal transform.

67. The method of claim 65 or 66, wherein the first and second invertible transforms are respectively defined by an invertible matrix, and wherein the method further comprises updating the values of each said invertible matrix based on the evaluated function; and repeating said steps to produce said learned values of each said invertible matrix.

68. The method of claim 67, wherein one or more values of each said invertible matrix is defined by a dependency on the input image.

69. The method of claim 67, wherein one or more values of each said invertible matrix is independent of the input image.

70. The method of claims 65 to 69, wherein the step of quantising modifies one or more elements of the transformed latent representation by a quantisation residual amount, and wherein said quantisation residual amount is reduced by said applying of the first invertible transform to the latent representation.

71. The method of any of claims 65 to 70, wherein the step of quantising is applied to the whole transformed latent representation without removing any elements of the transformed latent representation, and wherein said decoding is applied to the whole latent representation without removing any elements of the latent representation.

72. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; applying a first invertible transform to the latent representation to produce a trans- formed latent representation; quantising the transformed latent representation; transmitting the quantised transformed latent representation to a second computer system; applying a second invertible transform to the quantised transformed latent representa- tion to produce the latent representation; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

73. A data processing apparatus configured to perform the method of claims 65 to 72.