WO2024246275A1

WO2024246275A1 - Method and data processing system for lossy image or video encoding, transmission and decoding

Info

Publication number: WO2024246275A1
Application number: PCT/EP2024/065017
Authority: WO
Inventors: Christopher FINLAY; Bilal ABBASI; Arsalan ZAFAR; Christian BESENBRUCH
Original assignee: Deep Render Ltd
Current assignee: Deep Render Ltd
Priority date: 2023-06-02
Filing date: 2024-05-31
Publication date: 2024-12-05
Anticipated expiration: 2025-12-02

Abstract

A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; transmitting the latent representation to a second computer system; identifying one or more values of the latent representation that have been affected by an error; replacing one or more of the identified values of the latent representation with a replacement value; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image

Description

Method and data processing system for lossy image or video encoding, transmission and decoding

BACKGROUND

This invention relates to a method and system for lossy image or video encoding, transmission and decoding, a method, apparatus, computer program and computer readable storage medium for lossy image or video encoding and transmission, and a method, apparatus, computer program and computer readable storage medium for lossy image or video receipt and decoding.

There is increasing demand from users of communications networks for images and video content. Demand is increasing not just for the number of images viewed, and for the playing time of video; demand is also increasing for higher resolution content. This places increasing demand on communications networks and increases their energy use because of the larger amount of data being transmitted.

To reduce the impact of these issues, image and video content is compressed for transmission across the network. The compression of image and video content can be lossless or lossy compression. In lossless compression, the image or video is compressed such that all of the original information in the content can be recovered on decompression. However, when using lossless compression there is a limit to the reduction in data quantity that can be achieved. In lossy compression, some information is lost from the image or video during the compression process. Known compression techniques attempt to minimise the apparent loss of information by the removal of information that results in changes to the decompressed image or video that is not particularly noticeable to the human visual system. JPEG, JPEG2000, AVC, HEVC and AVI are examples of compression processes for image and/or video files.

In general terms, known lossy image compression techniques use the spatial correlations between pixels in images to remove redundant information during compression. For example, in an image of a blue sky, if a given pixel is blue, there is a high likelihood that the neighbouring pixels, and their neighbouring pixels, and so on, are also blue. There is accordingly no need to retain all the raw pixel data. Instead, we can retain only a subset of the pixels which take up fewer bits and infer the pixel values of the other pixels using information derived from spatial correlations.

A similar approach is applied in known lossy video compression techniques. That is, spatial correlations between pixels allow the removal of redundant information during compression. However, in video compression, there is further information redundancy in the form of temporal correlations. For example, in a video of an aircraft flying across a blue-sky background, most of the pixels of the blue sky do not change at all between frames of the video. The most of the blue sky pixel data for the frame at position t = 0 in the video is identical to that at position t = 10. Storing this identical, temporally correlated, information is inefficient. Instead, only the blue sky pixel data for a subset of the frames is stored and the rest are inferred from information derived from temporal correlations.

In the realm of lossy video compression in particular, the removal of redundant temporally correlated information in a video sequence is known inter-frame redundancy. One technique using inter-frame redundancy that is widely used in standard video compression algorithms involves the categorization of video frames into three types: I-frames, P-frames, and B-frames. Each frame type carries distinct properties concerning their encoding and decoding process, playing different roles in achieving high compression ratios while maintaining acceptable visual quality.

I-frames, or intra-coded frames, serve as the foundation of the video sequence. These frames are self-contained, each one encoding a complete image without reference to any other frame. In terms of compression, 1-frames are least compressed among all frame types, thus carrying the most data. However, their independence provides several benefits, including being the starting point for decompression and enabling random access, crucial for functionalities like fast-forwarding or rewinding the video.

P-frames, or predictive frames, utilize temporal redundancy in video sequences to achieve greater compression. Instead of encoding an entire image like an I-frame, a P-frame represents the difference between itself and the closest preceding I- or P-frame. The process, known as motion compensation, identifies and encodes only the changes that have occurred, thereby significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on previous frames for decoding. Consequently, any error during the encoding or transmission process may propagate to subsequent frames, impacting the overall video quality.

B-frames, or bidirectionally predictive frames, represent the highest level of compression. Unlike P-frames, B-frames use both the preceding and following frames as references in their encoding process. By predicting motion both forwards and backwards in time, B-frames encode only the differences that cannot be accurately anticipated from the previous and next frames, leading to substantial data reduction. Although this bidirectional prediction makes B-frames more complex to generate and decode, it does not propagate decoding errors since they are not used as references for other frames. Artificial intelligence (Al) based compression techniques achieve compression and decompression of images and videos through the use of trained neural networks in the compression and decompression process. Typically, during training of the neutral networks, the difference between the original image and video and the compressed and decompressed image and video is analyzed and the parameters of the neural networks are modified to reduce this difference while minimizing the data required to transmit the content. However, Al based compression methods may achieve poor compression results in terms of the appearance of the compressed image or video or the amount of information required to be transmitted.

An example of an Al based image compression process comprising a hyper-network is described in Balle, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv preprint arXiv: 1802.01436 (2018), which is hereby incorporated by reference.

An example of an Al based video compression approach is shown in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference.

A further example of an Al based video compression approach is shown in Mentzer, F., Agustsson, E., Balle, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural video compression using gans for detail synthesis and propagation. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVI (pp. 562-578), which is hereby incorporated by reference.

SUMMARY

According to the present invention, there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; transmitting the latent representation to a second computer system; identifying one or more values of the latent representation that have been affected by an error; replacing one or more of the identified values of the latent representation with a replacement value; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

Prior to the step of transmitting the latent representation, the latent representation may be divided into a plurality of sub-latents; and the plurality of sub-latents are transmitted to the second computer system; wherein the step of identifying one or more values of the latent representation that have been affected by an error comprises identifying one or more of the plurality of sub-latents containing the one or more values that have been affected by an error; and the step of replacing one or more of the identified values of the latent representation with a replacement value comprises replacing each of the values of the identified sub-latents with the replacement value. The replacement value may be selected from a set of values that correspond to the possible values of a probability distribution associated with the latent representation.

The replacement value may be the modal value of the probability distribution associated with the latent representation. The replacement value may be a zero value.

The replacement value may be not a possible value of a probability distribution associated with the latent representation.

When the method is used for compression, transmission and decompression of a video, the replacement value may be based on at least one previously decoded frame of the video. The replacement value may correspond to a value of a previously received latent representation corresponding to the at least one previously decoded frame of the video; wherein the location of the value of the previously received latent representation corresponds to the location of identified value of the latent representation.

A flow may be is applied to the previously received latent representation to obtain the replacement value .

The replacement value may be obtained from the output of a trained neural network that receives at least one of the previously decoded frames of the video and/or the corresponding previously received latent representation as an input. When the method is used for compression, transmission and decompression of a video, the output frame obtained from the latent representation comprising at least one replacement value may be used to compress, transmit and decompress a further frame of the video.

According to the present invention, there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; encoding the latent representation using a second trained neural network to produce a hyper-latent representation; entropy encoding the the latent representation and the hyper-latent representation to obtain a bitstream; transmitting the bitstream to a second computer system; entropy decoding the bitstream to obtain the hyper-latent representation; identifying one or more values of the hyper-latent representation that have been affected by an error; decoding the hyper-latent representation using a third trained neural network to produce an output; entropy decoding the bitstream using the output of the third trained neural network to obtain the latent representation; replacing one or more of the values of the latent representation that correspond to the identified values of the hyper-latent representation with a replacement value; and decoding the latent representation using a fourth trained neural network to produce an output image, wherein the output image is an approximation of the input image.

According to the present invention, there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; replacing one or more of the values of the latent representation with a replacement value; and decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; evaluating a function based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.

According to the present invention, there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation transmitted by a first computer system at a second computer system, the latent representation corresponding to an input image; identifying one or more values of the latent representation that have been affected by an error; replacing one or more of the identified values of the latent representation with a replacement value; and decoding the latent representation using a first trained neural network to produce an output image, wherein the output image is an approximation of the input image.

According to the present invention, there is provided a method for lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input frame at a first computer system; obtaining a flow between the input frame and a previously decoded frame of the video; obtaining a residual based on the input frame and the flow; encoding the residual using a first trained neural network to produce a set of residual parameters; encoding the flow using a second trained neural network to produce a set of flow parameters; transmitting the set of residual parameters and the set of flow parameters to a second computer system; identifying an error in at least one of the set of residual parameters and the set of flow parameters; using the set of residual parameters, the set of flow parameters and an error mitigation process based on the identified error to obtain an output frame, wherein the output frame is an approximation of the input frame.

The set of residual parameters may comprise a latent residual and the set of flow parameters may comprise a latent flow and the method may further comprise the steps of: decoding the latent residual using a third trained neural network to obtain an output residual; and decoding the latent flow using a fourth trained neural network to obtain an output flow; wherein the output residual and the output flow are used to obtain the output frame.

When the error is identified in the latent residual, the error mitigation process may comprise applying the output flow to a previously decoded output frame corresponding to the previously decoded frame of the video to obtain the output frame.

When the error is identified in the latent residual, the error mitigation process may comprise obtaining the output frame as the output of a function that receives the output flow and one or more previously decoded output frames as an input.

When the error is identified in the latent flow, the error mitigation process may comprise applying a previously decoded output flow corresponding to the previously decoded frame to obtain the output frame. When the error is identified in the latent flow, the error mitigation process may comprisee applying an estimated output flow to obtain the output frame, wherein the estimated flow is obtained based on a plurality of previously decoded output flows.

The method may further comprise the steps of: encoding the latent residual using a fifth trained neural network to produce a hyper-latent residual, wherein the hyper- latent residual is included in the set of residual parameters; encoding the latent flow using a sixth trained neural network to produce a hyper-latent flow, wherein the hyper-latent flow is included in the set of flow parameters; at the second computer system, decoding the hyper-latent residual using a seventh trained neural network to obtain an output and using the output to obtain the latent residual; and decoding the hyper-latent flow using an eighth trained neural network to obtain an output and using the output to obtain the latent flow.

When the error is identified in the hyper-latent residual, the error mitigation process may comprise applying the hyper- latent flow and the latent flow to obtain the output frame.

When the error is identified in the hyper- latent flow, the error mitigation process may comprise applying the hyper-latent residual and the latent residual to obtain the output frame.

The method may further comprise the steps of: encoding the hyper-latent residual using a ninth trained neural network to produce a hyper-hyper-latent residual, wherein the hyper-hyper-latent residual is included in the set of residual parameters; encoding the hyper-latent flow using a tenth trained neural network to produce a hyper-hyper-latent flow, wherein the hyper-hyper- latent flow is included in the set of flow parameters; at the second computer system, decoding the hyper-hyper-latent residual using an eleventh trained neural network to obtain an output and using the output to obtain the hyper-latent residual; and decoding the hyper-hyper-latent flow using a twelfth trained neural network to obtain an output and using the output to obtain the hyper-latent flow. When the error is identified in the hyper-hyper-latent residual, the error mitigation process may comprise applying the hyper-hyper-latent flow, the hyper-latent flow and the latent flow to obtain the output frame.

When the error is identified in the hyper-hyper-latent flow, the error mitigation process may comprise applying the hyper-hyper-latent residual, the hyper-latent residual and the latent residual to obtain the output frame.

When the error is identified in a plurality of the latent residual, the hyper-latent residual, the hyper-hyper-latent residual, the latent flow, the hyper-latent flow and the hyper-hyper-latent flow, the error mitigation process may comprise applying a process based on the other of the latent residual, the hyper-latent residual, the hyper-hyper-latent residual, the latent flow, the hyper-latent flow and the hyper-hyper-latent flow not identified as containing an error to obtain the output frame.

The error mitigation process may be applied to a sub-section of at least one of the set of residual parameters and the set of flow parameters. The error mitigation process may be applied to a sub-section of at least one of the residual latent or the flow latent.

At the second computer system, each of the third, fourth, seventh, eighth, eleventh and twelfth trained neural networks may be selected from a corresponding set of trained neural networks before decoding; wherein each selection is based on the error mitigation process.

According to the present invention, there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input frame at a first computer system; obtaining a flow between the input frame and a previously decoded frame of the video; obtaining a residual based on the input frame and the flow; encoding the residual using a first trained neural network to produce a set of residual parameters; encoding the flow using a second trained neural network to produce a set of flow parameters; transmitting the set of residual parameters and the set of flow parameters to a second computer system; identifying an error in at least one of the set of residual parameters and the set of flow parameters; using the set of residual parameters, the set of flow parameters and an error mitigation process based on the identified error to obtain an output frame, wherein the output frame is an approximation of the input frame; evaluating a function based on a difference between the output frame and the input frame; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. According to the present invention, there is provided a method for lossy video receipt and decoding, the method comprising the steps of: receiving a set of residual parameters and a set of flow parameters transmitted by a first computer system at a second computer system, the set of residual parameters and set of flow parameters and corresponding to an input frame; identifying an error in at least one of the set of residual parameters and the set of flow parameters; and using the set of residual parameters, the set of flow parameters and an error mitigation process based on the identified error to obtain an output frame, wherein the output frame is an approximation of the input frame.

According to the present invention, there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; extracting a plurality of values from the latent and producing an auxiliary latent based on the extracted values; transmitting the latent representation and the auxiliary latent to a second computer system; verifying the latent representation based on the values of the auxiliary latent; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

The method may further comprise the step of compressing the auxiliary latent prior to transmitting the auxiliary latent to the second computer system.

The compression may be is lossless or lossy compression. The compression may comprise encoding the auxiliary latent using a third trained neural network to obtain an auxiliary hyper-latent; and at the second computer system, decoding the auxiliary hyper-latent using a fourth trained neural network to obtain an output auxiliary latent; wherein the output auxiliary latent is used to verify the latent representation. The auxiliary latent may comprise a plurality of values, wherein each of the plurality of values is based on a plurality of values of the latent representation.

At least two of the plurality of values of the auxiliary latent may be based on the same value of the latent representation.

Each of the values of the latent representation may be used to determine only one of the plurality of values of the auxiliary latent.

Each of the plurality of values of the auxiliary latent may be based on a linear or non-linear combination of the plurality of the values of the latent representation.

The method may further comprise the steps of: at the first computer system, entropy encoding the latent representation before transmission; encoding the latent representation using a fifth trained neural network to produce a hyper-latent representation; transmitting the hyper-latent representation to the second computer system; decoding the hyper-latent representation using a sixth trained neural network to obtain an output; and entropy decoding the latent representation, wherein the output of the sixth trained neural network is used during the entropy decoding. The output of the sixth trained neural network may comprise one or more of the mean and the standard deviation of the probability distribution of the values of the latent representation.

The method may further comprise the steps of: at the first computer system, entropy encoding the auxiliary latent before transmission; and at the second computer system, entropy decoding the auxiliary latent before transmission, wherein the output of the sixth trained neural network is used during the entropy decoding.

The method may further comprise the step of, when an error is detected in at least one of the values of the latent representation in the verification step, replacing the error values with a replacement value based on at least one of the other values of the latent representation and at least one value of the auxiliary latent.

The replacement value may be additionally based on the output of the sixth trained neural network.

The replacement value may be selected from a probability distribution based on at least one of the of the mean and the standard deviation of the probability distribution of the values of the latent representation.

The steps of the production and transmission of the auxiliary latent and verification using the auxiliary latent may be performed or not performed based on a predetermined setting.

A plurality of auxiliary latents may be produced, transmitted and used in the verification step. According to the present invention, there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: encoding the input image using a first trained neural network to produce a latent representation; extracting a plurality of values from the latent and producing an auxiliary latent based on the extracted values; transmitting the latent representation and the auxiliary latent.

According to the present invention, there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation and a auxiliary latent transmitted by a first computer system at a second computer system, the latent representation and the auxiliary latent corresponding to an input image; verifying the latent representation based on the values of the auxiliary latent; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

According to the present invention, there is provided a data processing system configured to perform any of the methods above. According to the present invention, there is provided a data processing apparatus configured to perform any of the methods above.

According to the present invention, there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the methods above. According to the present invention, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the methods above.

According to the present invention there is provided a method for lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input frame at a first computer system; dcfin ing a sub-section of the input frame; encoding at least a section of the input frame using a first trained neural network to produce a latent representation; transmitting the latent representation and the flow to a second computer system; and decoding the latent representation using information based on a previously decoded frame and a second trained neural network to produce an output frame, wherein the output frame is an approximation of the input frame; wherein the information based on the previously decoded frame associated with the area corresponding to the sub-section of the input frame is not used to obtain the output frame.

The method may further comprise the step of: determining a flow between the input frame and a previous frame; wherein the information based on the previously decoded frame comprises the flow; and wherein the motion compensation in the location corresponding to the sub-section is set to zero prior when using the flow to obtain the output frame.

The method may further comprise the step of: obtaining a mask corresponding to the location of the sub-section. The mask may be an additional input to the first trained neural network when encoding the input frame to obtain the latent representation.

The mask may be additionally transmitted to the second computer system; and the mask may be an additional input to the second trained neural network when decoding the latent representation to produce the output frame.

The sub-section of the input frame may be encoded using the first trained neural network and the latent representation may be decoded using the second trained neural network to obtain an output sub-section; and the method may further comprise the steps of: encoding the section of the frame not associated with sub-section using a second trained neural network to obtain a second latent representation; additionally transmitting the second latent representation to the second computer system; and decoding the second latent representation using a fourth trained neural network to obtain an output section; wherein the the the information based on the previously decoded frame is applied to the output section; and the output sub-section and the output section are combined to obtain the output frame. The method may further comprise repeating the steps of the method for a further input frame, wherein the location of the sub-section of the further input frame is different to the location of the sub-section of the input frame.

The method may be repeated for a plurality of further input frames such that each frame location corresponds to at least one of the plurality of sub-sections of the input frames. The method may further comprise the steps of: encoding the flow using a fifth trained neural network to produce a latent flow; and at the second computer system, decoding the latent flow using a sixth trained neural network to retrieve the flow.

The height of the sub-section may be equal to the height of the input frame. The width of the sub-section may be equal to the width of the input frame.

The sub-section may extend diagonally across the input frame.

The sub-section may comprise pixels that are not adjacent.

The input frame may be divided into a plurality of blocks and the sub-section comprises at least one pixel associated with each block. The arrangement of pixels associated with the sub-section within each of the plurality of blocks may be the same.

The arrangement of pixels associated with the sub-section within at least one of the plurality of blocks may extend diagonally across the block.

The method may further comprise the steps of: at the second computer system, identifying one or more values of the latent representation that have been affected by an error; identifying if the one or more values of the output image corresponding to the one or more values of the latent representation that have been affected by an error are located within the sub-section; replacing one or more of the one or more identified values of the output image with a replacement value, wherein the replacement value is based on one or more values from the sub-section if the identified values are not located within the sub-section.

The replacement value may be based on an interpolation of the one or more values from the sub-section.

According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input frame at a first computer system; defining a sub-section of the input frame; encoding at least a section of the input frame using a first trained neural network to produce a latent representation; transmitting the latent representation and the flow to a second computer system; and decoding the latent representation using information based on a previously decoded frame and a second trained neural network to produce an output frame, wherein the output frame is an approximation of the input frame; wherein the information based on the previously decoded frame associated with the area corresponding to the sub-section of the input frame is not used to obtain the output frame; evaluating a function based on a difference between the output frame and the input frame; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input frames to produce a first trained neural network and a second trained neural network.

According to the present invention there is provided a method for lossy video encoding and transmission, the method comprising the steps of: receiving an input frame at a first computer system; defining a sub-section of the input frame; encoding at least a section of the input frame using a first trained neural network to produce a latent representation; transmitting the latent representation.

According to the present invention there is provided a method for lossy video receipt and decoding, the method comprising the steps of: receiving a latent representation and a definition of a sub-section of an input frame transmitted by a first computer system; decoding the latent representation using information based on a previously decoded frame and a second trained neural network to produce an output frame, wherein the output frame is an approximation of the input frame; wherein the information based on a previously decoded frame associated with the area corresponding to the sub-section of the input frame is not used to obtain the output frame;

According to the present invention there is provided a data processing system configured to perform the methods above.

According to the present invention there is provided a data processing apparatus configured to perform the methods above.

According to the present invention there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the methods above. According to the present invention there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the methods above.

BRIEF DESCRIPTION OF THE DRAWINGS Aspects of the invention will now be described by way of examples, with reference to the following figures in which:

Figure 1 illustrates an example of an image or video compression, transmission and decompression pipeline.

Figure 2 illustrates a further example of an image or video compression, transmission and decompression pipeline including a hyper-network.

Figure 3 illustrates an example of a video compression, transmission and decompression pipeline.

Figure 4 illustrates an example of a video compression, transmission and decompression system. DETAIEED DESCRIPTION OF THE DRAWINGS

Compression processes may be applied to any form of information to reduce the amount of data, or file size, required to store that information. Image and video information is an example of information that may be compressed. The Hie size required to store the information, particularly during a compression process when referring to the compressed file, may be referred to as the rate. In general, compression can be lossless or lossy. In both forms of compression, the file size is reduced. However, in lossless compression, no information is lost when the information is compressed and subsequently decompressed. This means that the original file storing the information is fully reconstructed during the decompression process. In contrast to this, in lossy compression information may be lost in the compression and decompression process and the reconstructed file may differ from the original file. Image and video files containing image and video data are common targets for compression.

In a compression process involving an image, the input image may be represented as x. The data representing the image may be stored in a tensor of dimensions H x W x C, where H represents the height of the image, W represents the width of the image and C represents the number of channels of the image. Each H x W data point of the image represents a pixel value of the image at the corresponding location. Each channel C of the image represents a different component of the image for each pixel which are combined when the image file is displayed by a device. For example, an image file may have 3 channels with the channels representing the red, green and blue component of the image respectively. In this case, the image information is stored in the RGB colour space, which may also be referred to as a model or a format. Other examples of colour spaces or formats include the CMKY and the YCbCr colour models. However, the channels of an image file are not limited to storing colour information and other information may be represented in the channels. As a video may be considered a series of images in sequence, any compression process that may be applied to an image may also be applied to a video. Each image making up a video may be referred to as a frame of the video.

The output image may differ from the input image and may be represented by x. The difference between the input image and the output image may be referred to as distortion or a difference in image quality. The distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way. An example of such a method is using the mean square error (MSE) between the pixels of the input image and the output image, but there are many other ways of measuring distortion, as will be known to the person skilled in the art. The distortion function may comprise a trained neural network.

Typically, the rate and distortion of a lossy compression process are related. An increase in the rate may result in a decrease in the distortion, and a decrease in the rate may result in an increase in the distortion. Changes to the distortion may affect the rate in a corresponding manner. A relation between these quantities for a given compression technique may be defined by a rate-distortion equation.

Al based compression processes may involve the use of neural networks. A neural network is an operation that can be performed on an input to produce an output. A neural network may be made up of a plurality of layers. The first layer of the network receives the input. One or more operations may be performed on the input by the layer to produce an output of the first layer. The output of the first layer is then passed to the next layer of the network which may perform one or more operations in a similar way. The output of the final layer is the output of the neural network.

Each layer of the neural network may be divided into nodes. Each node may receive at least part of the input from the previous layer and provide an output to one or more nodes in a subsequent layer. Each node of a layer may perform the one or more operations of the layer on at least part of the input to the layer. For example, a node may receive an input from one or more nodes of the previous layer. The one or more operations may include a convolution, a weight, a bias and an activation function. Convolution operations are used in convolutional neural networks. When a convolution operation is present, the convolution may be performed across the entire input to a layer. Alternatively, the convolution may be performed on at least part of the input to the layer.

Each of the one or more operations is defined by one or more parameters that are associated with each operation. For example, the weight operation may be defined by a weight matrix defining the weight to be applied to each input from each node in the previous layer to each node in the present layer. In this example, each of the values in the weight matrix is a parameter of the neural network. The convolution may be defined by a convolution matrix, also known as a kernel. In this example, one or more of the values in the convolution matrix may be a parameter of the neural network. The activation function may also be defined by values which may be parameters of the neural network. The parameters of the network may be varied during training of the network. Other features of the neural network may be predetermined and therefore not varied during training of the network. For example, the number of layers of the network, the number of nodes of the network, the one or more operations performed in each layer and the connections between the layers may be predetermined and therefore fixed before the training process takes place. These features that are predetermined may be referred to as the hyperparameters of the network. These features are sometimes referred to as the architecture of the network.

To train the neural network, a training set of inputs may be used for which the expected output, sometimes referred to as the ground truth, is known. The initial parameters of the neural network are randomized and the first training input is provided to the network. The output of the network is compared to the expected output, and based on a difference between the output and the expected output the parameters of the network are varied such that the difference between the output of the network and the expected output is reduced. This process is then repeated for a plurality of training inputs to train the network. The difference between the output of the network and the expected output may be defined by a loss function. The result of the loss function may be calculated using the difference between the output of the network and the expected output to determine the gradient of the loss function. Back-propagation of the gradient descent of the loss function may be used to update the parameters of the neural network using the gradients dL/dy of the loss function. A plurality of neural networks in a system may be trained simultaneously through back-propagation of the gradient of the loss function to each network. In the context of image or video compression, this type of system, where simultaneous training with back-propagation through each element or the whole network architecture may be referred to as end-to-end, learned image or video compression. Unlike in traditional compression algorithms that use primarily handcrafted, manually constructed steps, an end-to-end learned system learns itself during training what combination of parameters best achieves the goal of minimising the loss function. This approach is advantageous compared to systems that are not end-to-end learned because an end-to-end system has a greater flexibility to learn weights and parameters that might be counter-intuitive to someone handcrafting features.

It will be appreciated that the term "training" or "learning" as used herein means the process of optimizing an artificial intelligence or machine learning model, based on a given set of data. This involves iteratively adjusting the parameters of the model to minimize the discrepancy between the model’s predictions and the actual data, represented by the above-described rate-distortion loss function.

The training process may comprise multiple epochs. An epoch refers to one complete pass of the entire training dataset through the machine learning algorithm. During an epoch, the model’s parameters are updated in an effort to minimize the loss function. It is envisaged that multiple epochs may be used to train a model, with the exact number depending on various factors including the complexity of the model and the diversity of the training data.

Within each epoch, the training data may be divided into smaller subsets known as batches. The size of a batch, referred to as the batch size, may influence the training process. A smaller batch size can lead to more frequent updates to the model’s parameters, potentially leading to faster convergence to the optimal solution, but at the cost of increased computational resources. Conversely, a larger batch size involves fewer updates, which can be more computationally cllicicnt but might converge slower or even fail to converge to the optimal solution.

The learnable parameters are updated by a specified amount each time, determined by the learning rate. The learning rate is a hyperparameter that decides how much the parameters are adjusted during the training process. A smaller learning rate implies smaller steps in the parameter space and a potentially more accurate solution, but it may require more epochs to reach that solution. On the other hand, a larger learning rate can expedite the training process but may risk overshooting the optimal solution or causing the training process to diverge.

The training described herein may involve use of a validation set, which is a portion of the data not used in the initial training, which is used to evaluate the model’s performance and to prevent overfitting. Overfitting occurs when a model learns the training data too well, to the point that it fails to generalize to unseen data. Regularization techniques, such as dropout or L1/L2 regularization, can also be used to mitigate overfitting.

It will be appreciated that training a machine learning model is an iterative process that may comprise selection and tuning of various parameters and hyperparameters. As will be appreciated, the specific details, such as hyper parameters and so on, of the training process may vary and it is envisaged that producing a trained model in this way may achieved in a number of different ways with different epochs, batch sizes, learning rates, regularisations, and so on, the details of which are not essential to enabling the advantages and effects of the present disclosure, except where stated otherwise. The point at which an “untrained” neural network is considered be “trained” is envisaged to be case specific and depend on, for example, on a number of epochs, a plateauing of any further learning, or some other metric and is not considered to be essential in achieving the advantages described herein.

More details of an end-to-end, learned compression process will now be described. It will be appreciated that in some cases, end-to-end, learned compression processes may be combined with one or more components that are handcrafted or trained separately.

In the case of Al based image or video compression, the loss function may be defined by the rate distortion equation. The rate distortion equation may be represented by Loss = D + A * R, where D is the distortion function, A is a weighting factor, and R is the rate loss. A may be referred to as a lagrange multiplier. The langrange multiplier provides as weight for a particular term of the loss function in relation to each other term and can be used to control which terms of the loss function are favoured when training the network.

In the case of Al based image or video compression, a training set of input images may be used. An example training set of input images is the KODAK image set (for example at www.cs.albany.edu/ xypan/research/snr/Kodak.html). An example training set of input images is the IMAX image set. An example training set of input images is the Imagenet dataset (for example at www.image-net.org/download). An example training set of input images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at http ://challenge . compression.cc/tasks/) . An example of an Al based compression, transmission and decompression process 100 is shown in Figure 1. As a first step in the Al based compression process, an input image 5 is provided. The input image 5 is provided to a trained neural network 110 characterized by a function f acting as an encoder. The encoder neural network 110 produces an output based on the input image. This output is referred to as a latent representation of the input image 5. In a second step, the latent representation is quantised in a quantisation process 140 characterised by the operation Q- resulting in a quantized latent. The quantisation process transforms the continuous latent representation into a discrete quantized latent. An example of a quantization process is a rounding function.

In a third step, the quantized latent is entropy encoded in an entropy encoding process 150 to produce a bitstream 130. The entropy encoding process may be for example, range or arithmetic encoding. In a fourth step, the bitstream 130 may be transmitted across a communication network.

In a fifth step, the bitstream is entropy decoded in an entropy decoding process 160. The quantized latent is provided to another trained neural network 120 characterized by a function go acting as a decoder, which decodes the quantized latent. The trained neural network 120 produces an output based on the quantized latent. The output may be the output image of the Al based compression process 100. The encoder-decoder system may be referred to as an autoencoder.

Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly compress given input data up to close to the fundamental entropy limit of that data, as determined by the total entropy of the distribution of that data. Accordingly, one way in which end-to-end, learned compression can minimise the rate loss term of the rate-distortion loss function and thereby increase compression effectiveness is to learn autoencoder parameter values that produce low entropy latent representation distributions. Producing latent representations distributed with as low an entropy as possible allows entropy encoding to compress the latent distributions as close to or to the fundamental entropy limit for that distribution. The lower the entropy of the distribution, the more entropy encoding can losslessly compress it and the lower the amount of data in the corresponding bitstream. In some cases where the latent representation is distributed according to a gaussian or Laplacian distribution, this learning may comprise learning optimal location and scale parameters of the gaussian or Laplacian distributions, in other cases, it allows the learning of more flexible latent representation distributions which can further help to achieve the minimising of the rate-distortion loss function in ways that are not intuitive or possible to do with handcrafted features. Examples of these and other advantages are described in W02021/220008A1, which is incorporated in its entirety by reference.

Something which is closely linked to the entropy encoding of the latent distribution and which accordingly also has an effect on the effectiveness of compression of end-to-end learned approaches is the quantisation step. During inference, a rounding function may be used to quantise a latent representation distribution into bins of given sizes, a rounding function is not differentiable everywhere. Rather, a rounding function is effectively one or more step functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary between steps). Back propagating a gradient of a loss function through a rounding function is challenging. Instead, during training, quantisation by rounding function is replaced by one or more other approaches. For example, the functions of a noise quantisation model are differentiable everywhere and accordingly do allow backpropagation of the gradient of the loss function through the quantisation parts of the end-to-end, learned system. Alternatively, a straight-through estimator (STE) quantisation model or one other quantisation models may be used. It is also envisaged that different quantisation models may be used for during evaluation of different term of the loss function. For example, noise quantisation may used to evaluate the rate or entropy loss term of the rate-distortion loss function while STE quantisation may be used to evaluate the distortion term.

In a similar manner to how learning parameters top produce certain distributions of the latent representation facilitates achieving better rate loss term minimisation, end-to-end learning of the quantisation process achieves a similar effect. That is, learnable quantisation parameters provide the architecture with a further degree of freedom to achieve the goal of minimising the loss function. For example, parameters corresponding to quantisation bin sizes may be learned which is likely to result in an improved rate-distortion loss outcome compared to approaches using hand-crafted quantisation bin sizes.

Further, as the rate-distortion loss function constantly has to balance a rate loss term against a distortion loss term, it has been found that the more degrees of freedom the system has during training, the better the architecture is at achieving optimal rate and distortion trade off.

The system described above may be distributed across multiple locations and/or devices. For example, the encoder 110 may be located on a device such as a laptop computer, desktop computer, smart phone or server. The decoder 120 may be located on a separate device which may be referred to as a recipient device. The system used to encode, transmit and decode the input image 5 to obtain the output image 6 may be referred to as a compression pipeline.

The Al based compression process may further comprise a hyper-network 105 for the transmission of meta-information that improves the compression process. The hyper-network 105 comprises a trained neural network 115 acting as a hyper-encoder

and a trained neural network 125 acting as a hyper-decoder gg. An example of such a system is shown in Figure 2. Components of the system not further discussed may be assumed to be the same as discussed above. The neural network 115 acting as a hyper-decoder receives the latent that is the output of the encoder 110. The hyper-encoder 115 produces an output based on the latent representation that may be referred to as a hyper-latent representation. The hyper-latent is then quantized in a quantization process 145 characterised by Q^h to produce a quantized hyper-latent. The quantization process 145 characterised by Q^h may be the same as the quantisation process 140 characterised by Q discussed above.

In a similar manner as discussed above for the quantized latent, the quantized hyper-latent is then entropy encoded in an entropy encoding process 155 to produce a bitstream 135. The bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the quantized hyper-latent. The quantized hyper-latent is then used as an input to trained neural network 125 acting as a hyper-decoder. However, in contrast to the compression pipeline 100, the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder 115. Instead, the output of the hyper-decoder is used to provide parameters for use in the entropy encoding process 150 and entropy decoding process 160 in the main compression process 100. For example, the output of the hyper-decoder 125 can include one or more of the mean, standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation. In the example shown in Figure 2, only a single entropy decoding process 165 and hyper-decoder 125 is shown for simplicity. However, in practice, as the decompression process usually takes place on a separate device, duplicates of these processes will be present on the device used for encoding to provide the parameters to be used in the entropy encoding process 150.

Further transformations may be applied to at least one of the latent and the hyper-latent at any stage in the Al based compression process 100. For example, at least one of the latent and the hyper latent may be converted to a residual value before the entropy encoding process 150,155 is performed. The residual value may be determined by subtracting the mean value of the distribution of latents or hyper-latents from each latent or hyper latent. The residual values may also be normalised.

To perform training of the Al based compression process described above, a training set of input images may be used as described above. During the training process, the parameters of both the encoder 110 and the decoder 120 may be simultaneously updated in each training step. If a hyper- network 105 is also present, the parameters of both the hyper-encoder 115 and the hyper-decoder 125 may additionally be simultaneously updated in each training step. The training process may further include a generative adversarial network (GAN). When applied to an Al based compression process, in addition to the compression pipeline described above, an additional neutral network acting as a discriminator is included in the system. The discriminator receives an input and outputs a score based on the input providing an indication of whether the discriminator considers the input to be ground truth or fake. For example, the indicator may be a score, with a high score associated with a ground truth input and a low score associated with a fake input. For training of a discriminator, a loss function is used that maximizes the difference in the output indication between an input ground truth and input fake.

When a GAN is incorporated into the training of the compression process, the output image 6 may be provided to the discriminator. The output of the discriminator may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Alternatively, the discriminator may receive both the input image 5 and the output image 6 and the difference in output indication may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Training of the neural network acting as a discriminator and the other neutral networks in the compression process may be performed simultaneously. During use of the trained compression pipeline for the compression and transmission of images or video, the discriminator neural network is removed from the system and the output of the compression pipeline is the output image 6.

Incorporation of a GAN into the training process may cause the decoder 120 to perform hallucination. Hallucination is the process of adding information in the output image 6 that was not present in the input image 5. In an example, hallucination may add fine detail to the output image 6 that was not present in the input image 5 or received by the decoder 120. The hallucination performed may be based on information in the quantized latent received by decoder 120.

Details of a video compression process will now be described. As discussed above, a video is made up of a series of images arranged in sequential order. Al based compression process 100 described above may be applied multiple times to perform compression, transmission and decompression of a video. For example, each frame of the video may be compressed, transmitted and decompressed individually. The received frames may then be grouped to obtain the original video.

The frames in a video may be labelled based on the information from other frames that is used to decode the frame in a video compression, transmission and decompression process. As described above, frames which are decoded using no information from other frames may be referred to as I-frames. Frames which are decoded using information from past frames may be referred to as P-frames. Frames which are decoded using information from past frames and future frames may be referred to as B-frames. Frames may not be encoded and/or decoded in the order that they appear in the video. For example, a frame at a later time step in the video may be decoded before a frame at an earlier time.

The images represented by each frame of a video may be related. For example, a number of frames in a video may show the same scene. In this case, a number of different parts of the scene may be shown in more than one of the frames. For example, objects or people in a scene may be shown in more than one of the frames. The background of the scene may also be shown in more than one of the frames. If an object or the perspective is in motion in the video, the position of the object or background in one frame may change relative to the position of the object or background in another frame. The transformation of a part of the image from a first position in a first frame to a second position in a second frame may be referred to as flow, warping or motion compensation. The flow may be represented by a vector. One or more flows that represent the transformation of at least part of one frame to another frame may be referred to as a flow map.

An example Al based video compression, transmission, and decompression process 200 is shown in Figure 3. The process 200 shown in Figure 3 is divided into an I- frame part 201 for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be understood that these divisions into different parts are arbitrary and the process 200 may be also be considered as a single, end-to-end pipeline.

As described above, 1-frames do not rely on information from other frames so the I-frame part 201 corresponds to the compression, transmission, and decompression process illustrated in Figures 1 or 2. The specific details will not be repeated here but, in summary, an input image XQ is passed into an encoder neural network 203 producing a latent representation which is quantised and entropy encoded into a bitstream 204. The subscript 0 in o indicates the input image corresponds to a frame of a video stream at position t = 0. This may be the first frame of an entire video stream or the first frame of a chunk of a video stream made up of, for example, an I-frame and a plurality of subsequent P-frames and/or B-frames. The bitstream 204 is then entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed image XQ which in this case is an I-frame. The decoding step may be performed both locally at the same location as where the input image compression occurs as well as at the location where the decompression occurs. This allows the reconstructed image XQ to be available for later use by components of both the encoding and decoding sides of the pipeline.

In contrast to I-frames, P-frames (and B-frames) do rely on information from other frames. Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only the input image x_t that is to be compressed (corresponding to a frame of a video stream at position t), but also one or more previously reconstructed images x_t-i from an earlier frame t-1. As described above, the previously reconstructed x_t-l is available at both the encode and decode side of the pipeline and can accordingly be used for various purposes at both the encode and decode sides.

At the encode side, previously reconstructed images may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames. In the example of Figure 3, both the image being compressed x_t and the previously reconstructed image from an earlier frame x_t- are passed into a flow module part 206 of the pipeline. The flow module part 206 comprises an autoencoder such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 207 has been trained to produce a latent representation of a flow map from inputs x_t-l and x_t, which is indicative of inter- frame movement of pixels or pixel groups between x_t- and x_t. The latent representation of the flow map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208. On the decode side, the bitstream is entropy decoded and passed to a decoder neural network 209 to produce a reconstructed flow map f. The reconstructed flow map f is applied to the previously reconstructed image x_f_i to generate a warped image x_t- ,_w. It is envisaged that any suitable warping technique may be used, for example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further envisaged that a scale-space flow approach as described in the above paper may also optionally be used. The warped image x_t- ,_w is a prediction of how the previously reconstructed image Xt-i might have changed between frame positions t-1 and t, based on the output flow map produced by the flow module part 206 autoencoder system from the inputs of x_t and x_t- .

As with the I-frame, the reconstructed flow map f and corresponding warped image x_t- ,_w may be produced both on the encode side and the decode side of the pipeline so they are available for use by other components of the pipeline on both the encode and decode sides.

In the example of Figure 3, both the image being compressed x_t and the x_t- ,_w are passed into a residual module part 210 of the pipeline. The residual module part 210 comprises an autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 211 has been trained to produce a latent representation of a residual map indicative of differences between the input mage x_t and the warped image x_t- ,_w. The latent representation of the residual map is then quantised and entropy encoded into a bitstream 212 and transmitted. The bitstream 212 is then entropy decoded and passed into a decoder neural network 213 which reconstructs a residual map r from the decoded latent representation. Alternatively, a residual map may first be pre-calculated between x_t and the x_t-i,_w and the pre-calculated residual map may be passed into an autoencoder for compression only. This hand-crafted residual map approach is computationally simpler, but reduces the degrees of freedom with which the architecture may learn weights and parameters to achieve its goal during training of minimising the rate-distortion loss function.

Finally, on the decode side, the residual map r is applied (e.g. combined by addition, subtraction or a different operation) to the warped image to produce a reconstructed image x_t which is a reconstruction of image x_t and accordingly corresponds to a P-frame at position t in a sequence of frames of a video stream. It will be appreciated that the reconstructed image x_t can then be used to process the next frame. That is, it can be used to compress, transmit and decompress x_t+i, and so on until an entire video stream or chunk of a video stream has been processed.

Thus, for a block of video frames comprising an I-frame and n subsequent P-frames, the bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual map of each P-frame image. For completeness, whilst not illustrated in Figure 3, any of the autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those described in connection with Figure 2. Accordingly, the bitstream may also contain hyper and hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so on, of those networks as applicable.

Finally, the above approach may generally also be extended to B-frames, for example as is described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6680-6689) which is hereby incorporated by reference.

The above-described flow and residual based approach is highly effective at reducing the amount of data that needs to be transmitted because, as long as at least one reconstructed frame (e.g. I-frame x_t- ) is available, the encode side only needs to compress and transmit a flow map and a residual map (and any hyper or hyper-hyper parameter information, as applicable) to reconstruct a subsequent frame.

Figure 4 shows an example of an Al image or video compression process such as that described above in connection with Figures 1-3 implemented in a video streaming system 400. The system 400 comprises a first device 401 and a second device 402. The first and second devices 401, 402 may be user devices such as smartphones, tablets, AR/VR headsets or other portable devices. In contrast to known systems which primarily perform inference on GPUs such as Nvidia A 100, Geforce 3090, Gefore 4090 GPU cards, the system 400 of Figure 4 performs inference on a CPU of the first and second devices respectively. That is, compute for performing both encoding and decoding are performed by the respective CPUs of the first and second devices 401, 402. This places very different power usage, memory and runtime constraints on the implementation of the above methods than when implementing Al-based compression methods on GPUs. In one example, the CPU of first and second devices 401, 402 may comprise a Qualcomm Snapdragon CPU.

The first device 401 comprises a media capture device 403, such as a camera, arranged to capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404. The video stream 404 is passed to a pre-processing module 406 which splits the video stream into blocks of frames, various frames of which will be designated as I-frames, P-frames, and/or B-frames. The blocks of frames are then compressed by an Al-compression module 407 comprising the encode side of the Al-based video compression pipeline of Figure 3. The output of the Al-compression module is accordingly a bitstream 408a which is transmitted from the first device 401, for example via a communications channel, for example over one or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409 communications .

The second device 402 receives the communicated bitstream 408b which is passed to an Al-decompression module 410 comprising the decode side of the Al-based video compression pipeline of Figure 3. The output of the Al-decompression module 402 is the reconstructed I-frames, P-frames and/or B-frames which are passed to a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402.

It is envisaged that the system 400 of Figure 4 may be used for live video streaming at 30fps of a 1080p video stream, which means a cumulative latency of both the encode and decode side is below substantially 50ms, for example substantially 30ms or less. Achieving this level of runtime performance with only CPU compute on user devices presents challenges which are not addressed by known methods and systems or in the wider Al-compression literature.

For example, execution of different parts of the compression pipeline during inference may be optimized by adjusting the order in which operations are performed using one or more known CPU scheduling methods. Efficient scheduling can allow for operations to be performed in parallel, thereby reducing the total execution time. It is also envisaged that efficient management of memory resources may be implemented, including optimising caching methods such as storing frequently-accessed data in faster memory locations, and memory reuse, which minimizes memory allocation and deallocation operations.

A number of concepts related to the Al compression processes and/or their implementation in a hardware system discussed above will now be described. Although each concept is described separately, one or more of the concepts described below may be applied in an Al based compression process as described above.

Replacement values

As discussed above in the description of an Al compression process, prior to the transmission of the latent representation or quantized latent across the communication network, the latent representation is entropy encoded to produce the bitstream. If an error occurs during transmission of the bitstream between two computer systems, such as the loss of a packet, at least a part of the section of the bitstream that comprised the lost packet may no longer by able to be entropy decoded. This means that one or more values of the latent representation will be affected by the error in transmission. For example, the one or more values of the latent representation may not be recovered from the entropy decoding process.

Such a latent representation may still be decoded by a neural network acting as a decoder to obtain an output image or frame as discussed above. However, the distortion between the retrieved output image or frame and the corresponding input image and frame may be increased.

As an alternative, before the decoding of the latent representation to obtain an output image or frame, one or more of the identified values in the latent representation may be replaced by a replacement value. After replacement of the identified values with the replacement value, the latent representation may be decoded by the neural network acting as a decoder as described above. This may decrease the distortion between the retrieved output image compared to the input image.

Due to the properties of the range encoding and decoding algorithm, at least some or all of the data in the bitstream after the point at which the error occurred maybe lost, in this case, every value of the latent representation corresponding to this section of the bitstream maybe identified as containing the error. One or more of these values may be replaced by a replacement value before decoding by the neural network acting as a decoder to obtain the output image or frame.

Alternatively, before transmission, the latent representation may be split into a plurality of sub-sections and each sub-section may be separately entropy encoded to produce the bitstream 130. This results in a bitstream with sections corresponding to each of the plurality of sub-sections of the latent representation. The tensor defining the latent representation can be split in any fashion to generate the corresponding sections of the bitstream. For example, splitting along the column or channel dimensions is equally acceptable. Alternatively, the latent may be flattened into a one-dimensional tensor and split into arbitrary chunks. If an error occurs during transmission of the bitstream such as the loss of a packet, only the corresponding sub-section of the latent representation. This means that, if such an error occurs, only one or more of the sub-sections will be missing and therefore one or more parts of the retrieved latent representation will be missing. Each of the identified value of the one or more sub-sections may therefore be replaced with the replacement value.

The replacement value may be selected from a number of possible different replacement values depending on the nature of the error and the training of the networks acting as encoders and decoders in the Al compression and pipeline. For example, the input image or frame and the quantization process used during encoding may define a probability distribution of a respective latent representation, as discussed above. The probability distribution defines all of the possible values that the values of the latent representation may take. The replacement value used may be selected from the possible values of the probability distribution. For example, the replacement value may be a random sample of the probability distribution defining the latent representation.

The same replacement value may be used for each value identify as containing an error in the latent representation. Alternatively, different replacement values may be used. The value selected from the probability distribution may be the modal or most likely value of the probability distribution. The replacement value may be a zero value.

Alternatively, the replacement value may selected such that it does not correspond to any possible value of the probability distribution associated with the latent representation. In this case, the replacement value may be considered to be an error indicator. If the Al compression pipeline is trained with this method, the neural network acting as a decoder may interpret the replacement value as an error indicator and result in a certain process being performed by the neural network acting as a decoder to obtain the output image or frame.

When the Al compression pipeline is being used for compression, transmission and decompression of a video, previously decompressed frames of the video will be available at computer system at which decoding of the current frame is being performed. In this case, the replacement value used may be based on one or more of the previously decoded frames. For example, the location of the error values within the latent representation may be identified, and the values at the same location for one or more of the previously received latent representation corresponding to the previous frames may be used to determine the replacement value. For example, the values of the latent representation of the previously decoded frame may be used as the replacement value at corresponding locations. A flow value based on the current frame and a previous frame may be applied to the previous frame or previous latent representation before the replacement value is obtained.

Alternatively, an Al based process may be used to obtain the replacement value. For example, a neural network may receive one or more previously received frames or corresponding latent representation and optionally a mask indicating the location of the values identified as having an error in the current latent representation. The output of this neural network may be then used as the replacement values in the processes described above.

The output image or frame obtained from a latent representation containing replacement values may be used to obtain future frames of the video in a decompression process. As discussed above, and Al compression pipeline may comprise a hyper-network where the latent representation is encoded by a neural network acting as an encoder to obtain a hyper-latent representation. This hyper-latent representation may be decoded by a neural network acting as a decoder to obtain an output that may be used in the decoding process in various ways. A packet loss during transmission of the bitstream may result in at least a section of the received hyper-latent representation comprising an error value due to a packet loss in the bitstream. In the case where the hyper-latent representation is providing information to be used to entropy decode the latent representation, the section of the hyper latent representation identified as containing an error will correspond to an area of the corresponding entropy decoded latent representation. In this case, the corresponding values of the latent representation may be replaced with replacement values.

The method described above maybe used without any specific training of the Al compression pipeline. This means the Al compression pipeline may be trained without errors or replacement values being introduced. The process of identifying errors and inserting replacement values may then be introduced when the trained Al compression pipeline is in use.

Alternatively, the process described above may be introduced prior to training of the Al compression pipeline taking place. In this case, errors may be artificially introduced into latent representations and corrected by replacement values during the training process of the Al compression pipeline described in detail above.

Bitstream Swap As discussed above, Al-based compression pipelines may incorporate flow processes when compressing, transmitting and decompressing a video comprising multiple frames. In an example of such a process, the compression process may be divided into two parts. In a first part, a flow may be determined between the current flame to be compressed and a previous frame. This flow represents the relative motion between the present frame and the previous frame. In a second part, a residual between the current frame and the previous frame may be determined. The residual represents the change between the current frame and the previous frame that is not represented by the flow between the frames. The residual may be determined by subtracting the present frame from a warped frame obtained by applying the determined flow to the previous frame.

In such a process, the determined flow and residual may each be encoded by a corresponding neural network acting as an encoder to obtain a latent flow and a latent residual. These latents may subsequently be quantized, entropy encoded and transmitted as a bitstream in a similar manner for a single image or frame as discussed above. Each of the flow and residual may then be decoded by a corresponding neural network acting as a decoder at the receiver end. The flow may be applied to a previously decoded frame and then the residual applied to obtain an output frame that is an approximation of the input frame.

The application of a hyper network in an Al compression process for a single image or frame is discussed above, in a similar manner, a hyper network may be added to each of the flow and residual compression pipelines discussed above, resulting in the generation of a hyper-latent flow and hyper-latent residual. Furthermore, a hyper hyper network may be added to each of the hyper network pipelines, resulting in the generation of a hyper-hyper latent flow and hyper-hyper latent residual. In each of these examples, the addition of each pipeline therefore results in a corresponding section of the bitstream for each latent. For example, if a hyper network is present in both of the flow and residual pipelines, the bitstream corresponding to a single frame will comprise four sections, a latent flow section, a hyper-latent flow section, a latent residual section and a hyper-latent residual section. If a hyper hyper network is present in each set of the flow and residual pipelines, the bitstream will comprise six sections. The collection of the latents related to the flow may be referred to as a set of flow parameters. The collection of latents related to the residual may be referred to as a set of residual parameters.

If an error such as a packet loss occurs in the transmission of the bitstream, this may result in the loss of one or more of these sections of the bitstream described above. For example, A packet loss may cause an error in the flow latent for a particular frame, but the other sections, for example at least the residual latent, may be received without an error. When the hyper and hyper hyper components of the set of flow and residual parameters are present, one or more of these may also be received without an error. When one or more of the set of latent or residual parameters is determined to contain an error, an error mitigation process may additionally be performed at the decoding stage based on which of the set of flow or residual parameters has been received. This error mitigation process may change based on what parameters have been received.

For example, in the case where an error is identified in the latent residual such that the latent residual may not be used in the decoding process the error mitigation process may comprise applying the output flow to a previously decoded output frame corresponding to the previously decoded frame of the video to obtain the output frame. In this case, the output frame may have increased distortion compared to if the latent residual was present at the decoding stage, but the distortion of the output frame may still be improved. Alternatively, an estimated latent residual may be used during the decoding process. The estimated residual may be based on one or more previously received latent residuals. A neural network which receives a plurality of previously received latent residuals as an input may be used to obtain the estimated residual latent as an output.

In another example, in the case where the error is identified in the latent flow, the error mitigation process may comprise applying a previously decoded output flow corresponding to the previously decoded frame to obtain the output frame. Alternatively, an estimated flow may be used during the decoding process. The estimated flow may be based on a plurality of previously decoded output flows. For example, the change between the previous two output flows may be applied to the previous output flow to obtain an estimated flow. A neural network which receives the plurality of previously decoded output flows as an input may be used to obtain the estimated flow as an output.

Different error mitigation processes may be applied in a corresponding way if an error is identified in any one of the hyper-latent flow, the hyper-latent residual, the hyper-hyper latent flow and the hyper-hyper latent residual as components of the set of flow parameters and the set of residual parameters. The error mitigation process may vary based on which of the set of parameters has or has not been received. The error mitigation process may be applied to a subs-section of at least one of the set of flow and residual parameters. For example, if the error is identified in one of the hyper or hyper hyper latents, the values with an error may only correspond to a sub-section of the residual latent or the flow latent. In this case, the examples of the error mitigation process described above may only be applied to the corresponding sub-section of the latent.

The decoding process performed maybe varied based on which of the set of flow and residual parameters is received. For example, a set of neural networks acting as a decoder may have been separately trained for each of the different error cases that are possible based on which combination of the flow parameters and residual parameters is received without error. When a particular combination of latents comprising an error is detected and a corresponding error mitigation processes is applied, the corresponding neural network to be used as a decoder may be selected. This process of selecting the decoder from a set of possible decoders may be performed separately in each of the main, hyper and hyper-hyper pipeline for each trained neural network that is used as a decoder in that pipeline. The methods described above maybe used without any specific training of the Al compression pipeline. This means the Al compression pipeline may be trained without an error mitigation process as described above being performed. The error mitigation process may then be introduced to the trained Al compression pipeline when the pipeline is in use.

Auxiliary latent An alternative method to increase the error resilience of an Al based compression pipeline such as the example discussed above is to introduce an auxiliary latent. The auxiliary latent represents additional information that may be transmitted along with the latent representation, and may contain information that is redundant with the latent representation. After the input image or frame has been encoded by the neural network acting as an encoder to obtain a latent representation, an additional latent that may be referred to as an auxiliary latent may be produced.

The auxiliary latent may contain values based on the values of the latent representation. For example, one or more of the values of the auxiliary latent may be based on a linear or non-linear combination of one or more values of the latent representation. In a specific example, one or more of the values of the auxiliary latent may have been determined by the sum or the subtraction of a plurality of values of the latent representation. Each value of the auxiliary latent may be based on plurality of values of the latent representation. In this case, the size or dimensions of the auxiliary latent may be reduced compared to that of the corresponding latent representation.

The auxiliary latent may be derived by applying a kernel to the latent representation. The kernel may be a non-overlapping kernel to the latent representation that applies a dimensionality reduction. For example, the kernel may sums all values across the rows of the latent representation for one channel and one column.

There may be redundancy between different values of the auxiliary latent. This, for example, may mean that one or more of the values of the latent representation are each used to determine more than one of the values of the auxiliary latent. This means that information on each value of the latent representation is contained in more than one value of the auxiliary latent. Alternatively, there may be no redundancy in the values of the auxiliary latent. This may mean that one or more of the values of the latent representation is each only used in the determination of one of the values of the auxiliary latent.

A plurality of auxiliary latents may be obtained from a one latent representation. Redundancy may exist between the values of the plurality of latent representations in the same way as for a single latent representation. When a plurality of auxiliary latents are provided, the auxiliary latents may be derived using the same kernel as discussed above, but where the application of the kernel to the latent representation is offset for each iteration.

The auxiliary latent may be entropy encoded and transmitted together with the corresponding latent representation during operation of the Al compression pipeline. In this case, the bitstream therefore comprises a section corresponding to the latent representation and a section corresponding to the auxiliary latent. Alternatively, the auxiliary latent may undergo a further compression process before transmission. For example, the auxiliary latent may undergo an algorithm based compression process.

Alternatively, the auxiliary latent may be encoded by a neural network acting as an encoder to obtain a hyper auxiliary latent. This hyper auxiliary latent may be transmitted and decoded by a neural network acting as a decoder to retrieve the auxiliary latent at the recipient end. The compression of the auxiliary latent may be a lossy or lossless compression process. Upon receipt, the auxiliary latent may be used to verify the values of the latent representation to ensure that no error has been introduced into the values of the latent representation. This may mean analyzing the values of the latent representation using the values of the auxiliary latent too determine that the values of the latent representation are correct. This analysis may be based on the relationship between the values of the auxiliary latent and the values of the latent representation that was used during the encoding process.

For example, in the case where the values of the auxiliary latent are based on a linear combination of a plurality of values of the latent representation, the verification step may comprise solving the equation defined by the linear relationship to determine that the values of the latent representation are correct. In the case where each of the values of the latent representation is used to determine more than one of the values of the auxiliary latent during the encoding process, the verification process may involve the solving of simultaneous equations based on the multiple linear relationships between each of the values of the auxiliary latent and value of the latent representation.

In the case where an error is determined to be present in one or more values of the received latent representation, the auxiliary latent may further be used to correct the value of the latent representation that is determined to have an error. For example, the system of linear equations defined by the values of the auxiliary latent may be solved to derive the correct value of the latest representation. The value determined to have an error then may be replaced with the correct value. For example, a packet loss during transmission of the latent representation may result in a row of values in a channel of the latent representation being lost. If an auxiliary latent has been received which comprises values of the sum of the columns of that channel of the latent representation, each of the lost values of the latent representation may be recovered by solving the system of equations defined by the auxiliary latent that include the missing values.

Further information available at the decoding step may be used when determining the replacement value. For example, in the case where a hyper network is present in the Al compression pipeline, information provided by the hyper network at the decoding steps may additionally be used to determine a replacement value. For example, if the hyper network provides information about the probability distribution of the values that may be found in the later representation, the information available about the probability distribution may be used to determine the replacement value. For example, the mean and the standard deviation of the probability distribution maybe used in combination with the available information in the auxiliary latent to arrive at a more accurate estimate of the replacement value then would otherwise be possible.

When a hyper-network is present, the parameters provided by the hyper-network that may be used to entropy encode and entropy decode the latent representation for transmission may also be used to entropy encode and entropy decode the auxiliary latent. This is because the dependence of the values of the auxiliary latent on the values of the latent representation results in a relationship between the probability distribution associated with the values of the latent representation and the probability distribution of associated with the values of the auxiliary latent, namely that the same parameters can be used for each.

Applying the process to described above to replace values determined to have an error in the latent reputation with corrected values derived from the auxiliary latent may also be used to recover the lost packet corresponding to the error. As discussed above, on receipt of the bitstream at the receiver the bitstream may be entropy decoded to obtain the latent representation. This means that the individual packets making up the bitstream are entropy decoded in turn in order to retrieve the latent representation.

Due to the nature of the entropy decoding process, every packet is required in order to correctly retrieve the latent representation. As discussed above, this means that if a packet is lost the remainder of latent representation encoded after this point in the bitstream may not be recovered, even if the corresponding packets are present.

By using the process described above to correct the values of the latent representation corresponding to the lost packet, the entry decoding process may be reversed to retrieve the lost packet. The entropy decoding process may then be a continued using the retrieved packet to correctly obtain the remainder of the latent representation.

The presence of the auxiliary latent in the bitstream may require additional bitrate to transmit an image or frame. The decision to include an auxiliary latent in the compression and transmission of a particular image or frame may be made based upon a predetermined setting. For example, the Al based compression pipeline may be set in a non-error state where the auxiliary latent is not determined or transmitted. Alternatively, the pipeline may be set in an error resilient state where the auxiliary latent is derived and transmitted. The number or redudancy of the auxiliary latent derived may vary based upon the predetermined setting. For example, the number of auxiliary latents that are derived and transmitted may be increased if the increased error resiliency due to the increased number of auxiliary latents is determined to be more important than the increased bitrate needed to transmit the auxiliary latents.

Cyclic I-frame

The concept of I-frames and P-frames is discussed above. When a video comprising one or more I and P frames is compressed, transmitted and decompressed using an Al based compression pipeline such as the example given above, no information from previously decoded frames (sometimes referred to as temporal context) is used to compress, transmit and decompress an I-frame. In contrast to this, information from previously decoded frames is used to compress, transmit and decompress a P-frame.

In an alternative method, each frame may be considered a hybrid frame comprising both I and P components. In this method, for each input frame transmitted by the Al based compression pipeline, a particular region or sub-section of the input frame may be defined. The sub-section comprises one or more pixels of the input frame. This region or subsection may be compressed, transmitted and decompressed without using temporal context. This means that, where a flow is determined between the input frame and a previous frame, any information that maybe derived from that flow relating to the sub-section of the frame is not used during the decoding of the frame to obtain the output frame. The sub-section of the frame maybe considered equivalent to an I frame. At least some or all of the remaining section of the frame may be compressed, transmitted and decompressed using temporal context, for example a flow, as discussed above in the example of a Al based video compression pipeline. The remaining section of the frame may therefore be considered equivalent to a P-frame.

This approach may be applied in an Al based video compression pipeline as described above. A sub-section of the input frame to an Al based video compression pipeline may be identified before encoding the input frame using a trained neural network. On decoding, this sub-section may be decoded without the use of any information based on previously decoded frames, for example a flow between the input frame and a previous frame or a warped frame derived from a flow and a previously decoded frame as discussed above.

In the example of a flow based compression method such as that described above, the flow information corresponding to the location of the sub-section of the input frame may not be used when decoding the frame. For example, when the flow is used together with the previously decoded frame to obtain a warped or predicted frame which is subsequently used to obtain an output frame, the region of the warped or predicted frame corresponding to the location of the sub-section may be set to zero prior to being used to obtain the output frame. In an arrangement which uses a warped frame together with a decoded residual such as that described above, this may mean that all of the information required to obtain the sub-section of the output frame is contained within the residual.

A mask may be used to identify the location of the sub-section. The mask may be a tensor of equivalent dimensions to the input frame. A section of the mask at a location corresponding to the sub-section of the input frame may contain values to indicate the location of the sub-section. For example, the mask may have zero values at the location corresponding to the sub-section and unity values at every other location. Such an arrangement may allow the mask to be applied to the warped frame or any other output to remove information at a location corresponding to the location of the sub-section.

When a mask is used to define the location of the sub-section of the input frame as described above, the mask may additionally be used as an input to one or more of the neural networks operating in the Al based video compression pipeline. For example, the mask may be an additional input to the trained neural network acting as an encoder of the input frame to obtain a late representation, or any other trained neural network on the encoding side of the pipeline. Alternatively or additionally, the mask may be transmitted along with the bitstream to the receiving computer system. The mask may be used as an input to the neural network which decodes the latest representation to obtain the output frame, or any other trained neural network on the decoding side of the pipeline. The mask may be additionally compressed, transmitted and decompressed, for example using an Al based compression pipeline such as that discussed above.

In an arrangement, the identified sub-section of the input frame and at least some of or all of the remaining section of the input frame may be transmitted using separate pipelines of an Al based compression process. For example, the identified sub-section of the input frame may be compressed, transmitted and decompressed using an I-frame part of an Al based compression process as described above. The at least some or all of the remaining section of the input frame may be compressed, transmitted and decompressed using the P-frame part of an Al based compression process as described above. In this arrangement, the output sub-section obtained from the I-frame part and the output section obtained from the P-frame part may be combined after decompression to obtain a final output frame. The mask described above may be used to identify the sections of each frame to be assigned to each part of the Al based compression pipeline. The mask may also be used during the step of combining the output sections to obtain the output frame. For example, the mask may be used to determine the location of each of the output sections within the output frame.

The location of the subsection of the frame may vary between different frames of a video. A group of frames of a video may be referred to as a batch. Across a batch of frames, the location of the subsection may change between each frame such that the entire area of the frame has been located within the sub-section at least one in the batch of frames. Compression, transmission and decompression of a batch of frames in this way may result in better compression performance as discussed above. This is because, as compression of frames without temporal context (I-frames) typically requires a higher bit-rate than compression with temporal context (P-frames), determining a sub-section of each frame to not use temporal context may result in a stable average bitrate per frame, which leads to better performance.

The above description refers to the compression, transmission and decompression of frames. In this context, the discussion may also apply to the compression, transmission and decompression of residuals of frames as described above. The shape or area of the sub-section of the frame may take different forms. For example, the sub-section may extend across the entire height or the entire width of the input frame, when the location of the subsection changes across different frames of a group of frames, the location of the sub-section may horizontally or vertically scan across the frame. Alternatively or additionally, the area defined by the sub-section may extend diagonally across at least part of the frame.

Alternatively or additionally, one or more input frames may be divided into a plurality of blocks. Each block may be the same size or one or more of the blocks may vary in size for one or more frames. The sub-section of the frame may be defined by an area in one or more of the defined blocks. The arrangement of the area of the sub-section in each block may be the same. For example, the area of the sub-section in each block may be one or more of the horizontal, vertical and diagonal shapes discussed above may additionally extend across each block in the same manner.

During transmission of a bit-stream, errors can occur which result in one or more packets making up the bit-stream not being received. In this case, the received latent representation may be missing sections of data. When the method discussed above is used in an Al based compression pipeline, the presence of the sub-section may allow lost data to be retrieved. This is because the bitstream corresponding to the sub-section and the bistream corresponding to the remaining section of the frame may be located within different regions of the bitstream corresponding to the entire frame. For example, if the lost data is entirely located within the sub-section, information from the remaining section may be used to retrieve the information in the sub-section. Alternatively, if the lost data is located entirely in the remaining section, information from within the sub-section may be used to retrieve the lost data.

The lost data may be retrieved by a number of methods used for the prediction of values within a frame. For example, interpolation may be performed using information from the sub-section or the remaining section to obtain replacement values which are predictions of the original values that have been lost. The replacement values may be inserted into the output frame to replace the lost values.

The location of the sub-section as described above may be selected to increase the effectiveness of this error replacement. For example, the location of the sub-section may be distributed throughout the frame such that one or more pixels of the sub-section are not adjacent. Interleaving the sub-section in this way may increase the chance that more accurate replacement value can successfully be derived.

The subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal.

The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e.g., an FPGA (Held programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (Held programmable gate array) or an ASIC (application specific integrated circuit).

Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable devices, that includes one or more processors and computer readable media, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

The subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.

The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in multiple examples separately or in any suitable subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

Moreover, the separation of various system modules and components in the examples described above should not be understood as requiring such separation in all examples, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Claims

1. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; transmitting the latent representation to a second computer system; identifying one or more values of the latent representation that have been affected by an error; replacing one or more of the identified values of the latent representation with a replacement value; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

2. The method of claim 1, wherein prior to the step of transmitting the latent representation, the latent representation is divided into a plurality of sub-latents; and the plurality of sub-latents are transmitted to the second computer system; wherein the step of identifying one or more values of the latent representation that have been affected by an error comprises identifying one or more of the plurality of sub-latents containing the one or more values that have been affected by an error; and the step of replacing one or more of the identified values of the latent representation with a replacement value comprises replacing each of the values of the identified sub-latents with the replacement value.

3. The method of claims 1 or 2, wherein the replacement value is selected from a set of values that correspond to the possible values of a probability distribution associated with the latent representation.

4. The method of any one of claims 1 to 3, wherein the replacement value is the modal value of the probability distribution associated with the latent representation.

5. The method of any one of claims 1 to 4, wherein the replacement value is a zero value.

6. The method of claims 1 or 2, wherein the replacement value is not a possible value of a probability distribution associated with the latent representation.

7. The method of any one of claims 1 to 6, wherein when the method is used for compression, transmission and decompression of a video, the replacement value is based on at least one previously decoded frame of the video.

8. The method of claim 7, wherein the replacement value corresponds to a value of a previously received latent representation corresponding to the at least one previously decoded frame of the video; wherein the location of the value of the previously received latent representation corresponds to the location of identified value of the latent representation.

9. The method of claim 8, wherein a flow is applied to the previously received latent representation to obtain the replacement value.

10. The method of claim 7, wherein the replacement value is obtained from the output of a trained neural network that receives at least one of the previously decoded frames of the video and/or the corresponding previously received latent representation as an input.

11. The method of any one of claims 1 to 10 wherein when the method is used for compression, transmission and decompression of a video, the output frame obtained from the latent representation comprising at least one replacement value is used to compress, transmit and decompress a further frame of the video.

12. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; encoding the latent representation using a second trained neural network to produce a hyper-latent representation; entropy encoding the the latent representation and the hyper-latent representation to obtain a bitstream; transmitting the bitstream to a second computer system; entropy decoding the bitstream to obtain the hyper-latent representation; identifying one or more values of the hyper-latent representation that have been affected by an error; decoding the hyper-latent representation using a third trained neural network to produce an output; entropy decoding the bitstream using the output of the third trained neural network to obtain the latent representation; replacing one or more of the values of the latent representation that correspond to the identified values of the hyper-latent representation with a replacement value; and decoding the latent representation using a fourth trained neural network to produce an output image, wherein the output image is an approximation of the input image.

13. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; replacing one or more of the values of the latent representation with a replacement value; and decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; evaluating a function based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.

14. A method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation transmitted by a first computer system at a second computer system, the latent representation corresponding to an input image; identifying one or more values of the latent representation that have been affected by an error; replacing one or more of the identified values of the latent representation with a replacement value; and decoding the latent representation using a first trained neural network to produce an output image, wherein the output image is an approximation of the input image.

15. A data processing system configured to perform the method of any one of claims 1 to 13.

16. A data processing apparatus configured to perform the method of claim 14.

17. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of claim 14.

18. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of claim 14.

19. A method for lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input frame at a first computer system; obtaining a flow between the input frame and a previously decoded frame of the video; obtaining a residual based on the input frame and the flow; encoding the residual using a first trained neural network to produce a set of residual parameters; encoding the flow using a second trained neural network to produce a set of flow parameters; transmitting the set of residual parameters and the set of flow parameters to a second computer system; identifying an error in at least one of the set of residual parameters and the set of flow parameters; using the set of residual parameters, the set of flow parameters and an error mitigation process based on the identified error to obtain an output frame, wherein the output frame is an approximation of the input frame.

20. The method of claim 19, wherein the set of residual parameters comprises a latent residual and the set of flow parameters comprises a latent flow and the method further comprises the steps of: decoding the latent residual using a third trained neural network to obtain an output residual; and decoding the latent flow using a fourth trained neural network to obtain an output flow; wherein the output residual and the output flow are used to obtain the output frame.

21. The method of claim 20, wherein when the error is identified in the latent residual, the error mitigation process comprises applying the output flow to a previously decoded output frame corresponding to the previously decoded frame of the video to obtain the output frame.

22. The method of claim 20, wherein when the error is identified in the latent residual, the error mitigation process comprises obtaining the output frame as the output of a function that receives the output flow and one or more previously decoded output frames as an input.

23. The method of claim 20, wherein when the error is identified in the latent flow, the error mitigation process comprises applying a previously decoded output flow corresponding to the previously decoded frame to obtain the output frame.

24. The method of claim 20, wherein when the error is identified in the latent flow, the error mitigation process comprises applying an estimated output flow to obtain the output frame, wherein the estimated flow is obtained based on a plurality of previously decoded output flows.

25. The method of any one of claims 20 to 24, further comprising the steps of: encoding the latent residual using a fifth trained neural network to produce a hyper-latent residual, wherein the hyper-latent residual is included in the set of residual parameters; encoding the latent flow using a sixth trained neural network to produce a hyper-latent flow, wherein the hyper- latent flow is included in the set of flow parameters; at the second computer system, decoding the hyper-latent residual using a seventh trained neural network to obtain an output and using the output to obtain the latent residual; and decoding the hyper-latent flow using an eighth trained neural network to obtain an output and using the output to obtain the latent flow.

26. The method of claim 25, wherein when the error is identified in the hyper-latent residual, the error mitigation process comprises applying the hyper-latent flow and the latent flow to obtain the output frame.

27. The method of claim 25, wherein when the error is identified in the hyper- latent flow, the error mitigation process comprises applying the hyper-latent residual and the latent residual to obtain the output frame.

28. The method of any one of claims 25 to 27, further comprising the steps of: encoding the hyper-latent residual using a ninth trained neural network to produce a hyper-hyper-latent residual, wherein the hyper-hyper-latent residual is included in the set of residual parameters; encoding the hyper-latent flow using a tenth trained neural network to produce a hyper- hyper-latent flow, wherein the hyper-hyper-latent flow is included in the set of flow parameters; at the second computer system, decoding the hyper-hyper-latent residual using an eleventh trained neural network to obtain an output and using the output to obtain the hyper-latent residual; and decoding the hyper-hyper-latent flow using a twelfth trained neural network to obtain an output and using the output to obtain the hyper-latent flow.

29. The method of claim 28, wherein when the error is identified in the hyper-hyper-latent residual, the error mitigation process comprises applying the hyper-hyper-latent flow, the hyper-latent flow and the latent flow to obtain the output frame.

30. The method of claim 28, wherein when the error is identified in the hyper-hyper-latent flow, the error mitigation process comprises applying the hyper-hyper-latent residual, the hyper-latent residual and the latent residual to obtain the output frame.

31. The method of claim 28, wherein when the error is identified in a plurality of the latent residual, the hyper-latent residual, the hyper-hyper-latent residual, the latent flow, the hyper-latent flow and the hyper-hyper-latent flow, the error mitigation process comprising applying a process based on the other of the latent residual, the hyper-latent residual, the hyper-hyper-latent residual, the latent flow, the hyper-latent flow and the hyper-hyper-latent flow not identified as containing an error to obtain the output frame.

32. The method of any one of claims 19 to 31, wherein the error mitigation process is applied to a sub-section of at least one of the set of residual parameters and the set of flow parameters.

33. The method of claim 32 when dependent on claim 20, wherein the error mitigation process is applied to a sub-section of at least one of the residual latent or the flow latent.

34. The method of any one of claims 20 to 33, wherein, at the second computer system, each of the third, fourth, seventh, eighth, eleventh and twelfth trained neural networks are selected from a corresponding set of trained neural networks before decoding; wherein each selection is based on the error mitigation process.

35. A method of training one or more neural networks, the one or more neural networks being for use in lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input frame at a first computer system; obtaining a flow between the input frame and a previously decoded frame of the video; obtaining a residual based on the input frame and the flow; encoding the residual using a first trained neural network to produce a set of residual parameters; encoding the flow using a second trained neural network to produce a set of flow parameters; transmitting the set of residual parameters and the set of flow parameters to a second computer system; identifying an error in at least one of the set of residual parameters and the set of flow parameters; using the set of residual parameters, the set of flow parameters and an error mitigation process based on the identified error to obtain an output frame, wherein the output frame is an approximation of the input frame; evaluating a function based on a difference between the output frame and the input frame; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.

36. A method for lossy video receipt and decoding, the method comprising the steps of: receiving a set of residual parameters and a set of flow parameters transmitted by a first computer system at a second computer system, the set of residual parameters and set of flow parameters and corresponding to an input frame; identifying an error in at least one of the set of residual parameters and the set of flow parameters; and using the set of residual parameters, the set of flow parameters and an error mitigation process based on the identified error to obtain an output frame, wherein the output frame is an approximation of the input frame.

37. A data processing system configured to perform the method of any one of claims 20 to 35.

38. A data processing apparatus configured to perform the method of claim 36.

39. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of claim 36.

40. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of claim 36.

41. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; extracting a plurality of values from the latent and producing an auxiliary latent based on the extracted values; transmitting the latent representation and the auxiliary latent to a second computer system; verifying the latent representation based on the values of the auxiliary latent; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

42. The method of claim 41, further comprising the step of compressing the auxiliary latent prior to transmitting the auxiliary latent to the second computer system.

43. The method of claim 42, wherein the compression is lossless or lossy compression.

44. The method of claim 42 or 43, wherein the compression comprises encoding the auxiliary latent using a third trained neural network to obtain an auxiliary hyper-latent; and at the second computer system, decoding the auxiliary hyper-latent using a fourth trained neural network to obtain an output auxiliary latent; wherein the output auxiliary latent is used to verify the latent representation.

45. The method of any one of claims 41 to 44, wherein the auxiliary latent comprises a plurality of values, wherein each of the plurality of values is based on a plurality of values of the latent representation.

46. The method of claim 45, wherein at least two of the plurality of values of the auxiliary latent are based on the same value of the latent representation.

47. The method of claim 45, wherein each of the values of the latent representation are used to determine only one of the plurality of values of the auxiliary latent.

48. The method of any one of claims 45 to 47, wherein each of the plurality of values of the auxiliary latent is based on a linear or non-linear combination of the plurality of the values of the latent representation.

49. The method of any one of claims 41 to 48, further comprising the steps of: at the first computer system, entropy encoding the latent representation before transmission; encoding the latent representation using a fifth trained neural network to produce a hyper-latent representation; transmitting the hyper-latent representation to the second computer system; decoding the hyper-latent representation using a sixth trained neural network to obtain an output; and entropy decoding the latent representation, wherein the output of the sixth trained neural network is used during the entropy decoding.

50. The method of claim 49, wherein the output of the sixth trained neural network comprises one or more of the mean and the standard deviation of the probability distribution of the values of the latent representation.

51. The method of claim 49 or 50, further comprising the steps of: at the first computer system, entropy encoding the auxiliary latent before transmission; and at the second computer system, entropy decoding the auxiliary latent before transmission, wherein the output of the sixth trained neural network is used during the entropy decoding.

52. The method of any one of claims 41 to 51, further comprising the step of, when an error is detected in at least one of the values of the latent representation in the verification step, replacing the error values with a replacement value based on at least one of the other values of the latent representation and at least one value of the auxiliary latent.

53. The method of claim 52 when dependent on claim 49 or 50, wherein the replacement value is additionally based on the output of the sixth trained neural network.

54. The method of claim 53. wherein the replacement value is selected from a probability distribution based on at least one of the of the mean and the standard deviation of the probability distribution of the values of the latent representation.

55. The method of any one of claims 41 to 54, Wherein the steps of the production and transmission of the auxiliary latent and verification using the auxiliary latent may be performed or not performed based on a predetermined setting.

56. The method of any one of claims 41 to 55, wherein a plurality of auxiliary latents are produced, transmitted and used in the verification step.

57. A method for lossy image or video encoding and transmission, the method comprising the steps of: encoding the input image using a first trained neural network to produce a latent representation; extracting a plurality of values from the latent and producing an auxiliary latent based on the extracted values; transmitting the latent representation and the auxiliary latent.

58. A method for lossy image or video receipt and decoding, the method comprising the steps of: receiving a latent representation and a auxiliary latent transmitted by a first computer system at a second computer system, the latent representation and the auxiliary latent corresponding to an input image; verifying the latent representation based on the values of the auxiliary latent; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.

59. A data processing system configured to perform the method of any one of claims 41 to 56.

60. A data processing apparatus configured to perform the method of claims 57 or 58.

61. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of claims 57 or 58.

62. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of claims 57 or 58.

63. A method for lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input frame at a first computer system; defining a sub-section of the input frame; encoding at least a section of the input frame using a first trained neural network to produce a latent representation; transmitting the latent representation and the flow to a second computer system; and decoding the latent representation using information based on a previously decoded frame and a second trained neural network to produce an output frame, wherein the output frame is an approximation of the input frame; wherein the information based on the previously decoded frame associated with the area corresponding to the sub-section of the input frame is not used to obtain the output frame.

64. The method of claim 63, further comprising the step of: determining a flow between the input frame and a previous frame; wherein the information based on the previously decoded frame comprises the flow; and wherein the motion compensation in the location corresponding to the sub-section is set to zero prior when using the flow to obtain the output frame.

65. The method of claim 63 or 64, further comprising the step of: obtaining a mask corresponding to the location of the sub-section.

66. The method of claim 65, wherein the mask is an additional input to the first trained neural network when encoding the input frame to obtain the latent representation.

67. The method of claim 65 or claim 66, wherein the mask is additionally transmitted to the second computer system; and the mask is an additional input to the second trained neural network when decoding the latent representation to produce the output frame.

68. The method of claim 63, wherein the sub-section of the input frame is encoded using the first trained neural network and the latent representation is decoded using the second trained neural network to obtain an output sub-section; and the method further comprises the steps of: encoding the section of the frame not associated with sub-section using a second trained neural network to obtain a second latent representation; additionally transmitting the second latent representation to the second computer system; and decoding the second latent representation using a fourth trained neural network to obtain an output section; wherein the the the information based on the previously decoded frame is applied to the output section; and the output sub-section and the output section are combined to obtain the output frame.

69. The method of any one of claims 63 to 68, further comprising repeating the steps of the method for a further input frame, wherein the location of the sub-section of the further input frame is different to the location of the sub-section of the input frame.

70. The method of claim 69, wherein the method is repeated for a plurality of further input frames such that each frame location corresponds to at least one of the plurality of sub-sections of the input frames.

71. The method of any one of claims 64 to 70, further comprising the steps of: encoding the flow using a fifth trained neural network to produce a latent flow; and at the second computer system, decoding the latent flow using a sixth trained neural network to retrieve the flow.

72. The method of any one of claims 63 to 71, wherein the height of the sub-section is equal to the height of the input frame.

73. The method of any one of claims 63 to 72, wherein the width of the sub-section is equal to the width of the input frame.

74. The method of any one of claims 63 to 73, wherein the sub-section extends diagonally across the input frame.

75. The method of any one of claims 63 to 74, wherein the sub-section comprises pixels that are not adjacent.

76. The method of any one of claims 63 to 75, wherein the input frame is divided into a plurality of blocks and the sub-section comprises at least one pixel associated with each block.

77. The method of claim 76, wherein the arrangement of pixels associated with the sub-section within each of the plurality of blocks is the same.

78. The method of claim 76 or claim 77, wherein the arrangement of pixels associated with the sub-section within at least one of the plurality of blocks extends diagonally across the block.

79. The method of any one of claims 63 to 78, further comprising the steps of: at the second computer system, identifying one or more values of the latent representation that have been affected by an error; identifying if the one or more values of the output image corresponding to the one or more values of the latent representation that have been affected by an error are located within the sub-section; replacing one or more of the one or more identified values of the output image with a replacement value, wherein the replacement value is based on one or more values from the sub-section if the identified values are not located within the sub-section.

80. The method of claim 79, wherein the replacement value is based on an interpolation of the one or more values from the sub-section.

81. A method of training one or more neural networks, the one or more neural networks being for use in lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input frame at a first computer system; defining a sub-section of the input frame; encoding at least a section of the input frame using a first trained neural network to produce a latent representation; transmitting the latent representation and the flow to a second computer system; and decoding the latent representation using information based on a previously decoded frame and a second trained neural network to produce an output frame, wherein the output frame is an approximation of the input frame; wherein the information based on the previously decoded frame associated with the area corresponding to the sub-section of the input frame is not used to obtain the output frame; evaluating a function based on a difference between the output frame and the input frame; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input frames to produce a first trained neural network and a second trained neural network.

82. A method for lossy video encoding and transmission, the method comprising the steps of: receiving an input frame at a first computer system; defining a sub-section of the input frame; encoding at least a section of the input frame using a first trained neural network to produce a latent representation; transmitting the latent representation.

83. A method for lossy video receipt and decoding, the method comprising the steps of: receiving a latent representation and a definition of a sub-section of an input frame transmitted by a first computer system; decoding the latent representation using information based on a previously decoded frame and a second trained neural network to produce an output frame, wherein the output frame is an approximation of the input frame; wherein the information based on a previously decoded frame associated with the area corresponding to the sub-section of the input frame is not used to obtain the output frame;

84. A data processing system configured to perform the method of any one of claims 63 to 81.

85. A data processing apparatus configured to perform the method of claims 82 or 83.

86. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of claims 82 or 83.

87. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of claims 82 or 83.