WO2025119707A1 - Method and data processing system for lossy image or video encoding, transmission and decoding - Google Patents
Method and data processing system for lossy image or video encoding, transmission and decoding Download PDFInfo
- Publication number
- WO2025119707A1 WO2025119707A1 PCT/EP2024/083641 EP2024083641W WO2025119707A1 WO 2025119707 A1 WO2025119707 A1 WO 2025119707A1 EP 2024083641 W EP2024083641 W EP 2024083641W WO 2025119707 A1 WO2025119707 A1 WO 2025119707A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- neural network
- input
- output
- input image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/117—Filters, e.g. for pre-processing or post-processing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/30—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability
- H04N19/33—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability in the spatial domain
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/90—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
Definitions
- This invention relates to a method and system for lossy image or video encoding, transmission and decoding, a method, apparatus, computer program and computer readable storage medium for lossy image or video encoding and transmission, and a method, apparatus, computer program and computer readable storage medium for lossy image or video receipt and decoding.
- the compression of image and video content can be lossless or lossy compression.
- lossless compression the image or video is compressed such that all of the original information in the content can be recovered on decompression.
- lossless compression there is a limit to the reduction in data quantity that can be achieved.
- lossy compression some information is lost from the image or video during the compression process.
- Known compression techniques attempt to minimise the apparent loss of information by the removal of information that results in changes to the decompressed image or video that is not particularly noticeable to the human visual system.
- JPEG, JPEG2000, AVC, HEVC and AVI are examples of compression processes for image and/or video files.
- known lossy image compression techniques use the spatial correlations between pixels in images to remove redundant information during compression.
- I-frames or intra-coded frames, serve as the foundation of the video sequence. These frames are self-contained, each one encoding a complete image without reference to any other frame. In terms of compression, I-frames are least compressed among all frame types, thus carrying the most data. However, their independence provides several benefits, including being the starting point for decompression and enabling random access, crucial for functionalities like fast-forwarding or rewinding the video.
- P-frames, or predictive frames utilize temporal redundancy in video sequences to achieve greater compression.
- a P-frame represents the difference between itself and the closest preceding I- or P-frame.
- the process known as motion compensation, identifies and encodes only the changes that have occurred, thereby significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on previous frames for decoding. Consequently, any error during the encoding or transmission process may propagate to subsequent frames, impacting the overall video quality.
- B-frames or bidirectionally predictive frames, represent the highest level of compression. Unlike P-frames, B-frames use both the preceding and following frames as references in their encoding process.
- B-frames By predicting motion both forwards and backwards in time, B-frames encode only the differences that cannot be accurately anticipated from the previous and next frames, leading to substantial data reduction. Although this bidirectional prediction makes B-frames more complex to generate and decode, it does not propagate decoding errors since they are not used as references for other frames.
- Artificial intelligence (AI) based compression techniques achieve compression and decom- pression of images and videos through the use of trained neural networks in the compression and decompression process. Typically, during training of the neutral networks, the difference between the original image and video and the compressed and decompressed image and video is analyzed and the parameters of the neural networks are modified to reduce this difference while minimizing the data required to transmit the content.
- AI based compression methods may achieve poor compression results in terms of the appearance of the compressed image or video or the amount of information required to be transmitted.
- An example of an AI based image compression process comprising a hyper-network is described in Ballé, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv preprint arXiv:1802.01436 (2016), which is hereby incorporated by reference.
- An example of an AI based video compression approach is shown in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression.
- Figure 3 of which shows an architecture that calculates optical flow with a flow model, UFlow, and encodes the calculated optical flow with a flow encoder, Eflow.
- the method comprises receiving an input image at a first computer system; downsampling the input image with a downsampler to produce a downsampled input image; encoding the downsampled input image using a first neural network to produce a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; upsampling the output image with an upsampler to produce an upsampled output image; evaluating a function based on a difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.
- the method described above may comprise a third neural network for upsampling, and wherein the method may include updating the parameters of said third neural network based on the evaluated function.
- the method described above may comprise a downsampler configured for either bilinear or bicubic downsampling.
- the method described above may comprise a Gaussian blur filter in the downsampler.
- method may comprise (i) updating the parameters of the first neural network and the second neural network based on the evaluated function for a first number of steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the third neural network, (ii) freezing the parameters of the first and second neural networks after said first number of steps, and performing said upsampling and downsampling, and said updating of the parameters of the third neural network for a second number of said steps.
- the method described above may comprise a fourth neural network in the down- sampler, and may further include updating the parameters of said fourth neural network based on the evaluated function.
- the method described above may comprise an upsampler configured for either bilinear or bicubic upsampling.
- method as described above may comprise (i) updating the parameters of the first neural network and the second neural network based on the evaluated function for a first number of said steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the fourth neural network, (ii) freezing the parameters of the first and second neural networks after said number first of steps, and performing said downsampling and said updating of the parameters of the fourth neural network for a second number of said steps.
- the method described above may comprise entropy encoding the latent representation into a bitstream having a length, wherein the function is further based on said bitstream length, and wherein said updating the parameters of the third or fourth neural network is based on the evaluated function based on the bitstream length.
- the method described above may comprise determining the difference between one or more of the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image based on the output of a fifth neural network acting as a discriminator.
- the method of any as described above may comprise calculating the difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image.
- the difference is expressed in terms of a mean squared error (MSE) and/or a structural similarity index measure (SSIM).
- MSE mean squared error
- SSIM structural similarity index measure
- the method described above may comprise a term defining a visual perceptual metric.
- the method described above may comprise a visual perceptual metric, wherein the term defining the metric comprises a MS-SIM metric.
- a method of training one or more neural networks comprising receiving an input image at a first computer system, encoding the input image using a first neural network to produce a latent representation, decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image, upsampling the output image with an upsampler to produce an upsampled output image, the upsampler comprising a third neural network, evaluating a function based on a difference between one or both of: the output image and the input image, and/or the upsampled output image and the input image, updating the parameters of the third neural network based on the evaluated function, and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.
- a method of training one or more neural networks for use in lossy image or video encoding, transmission and decoding comprises receiving an input image at a first computer system, downsampling the input image with a downsampler comprising a fourth neural network to produce a downsampled input image, encoding the downsampled input image using a first neural network to produce a latent representation, decoding the latent representation using a second neural network to produce an output image approximating the input image. Further steps involve evaluating a function based on differences between various images and updating the parameters of the fourth neural network based on the evaluated function.
- the method described above may comprise producing the previously downsampled input image by performing bilinear or bicubic downsampling on the input image. According to an aspect of the present disclosure, there is provided a method for lossy image or video encoding, transmission and decoding.
- the method comprises the steps of receiving an input image at a first computer system, downsampling the input image with a downsampler, encoding the downsampled input image using a first trained neural network to produce a latent representation, transmitting the latent representation to a second computer system, decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image, and upsampling the output image with an upsampler to produce an upsampled output image.
- a method for lossy image or video encoding, transmission and decoding comprising the steps of receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; transmitting the latent representation to a second computer system; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein one or more of the above steps comprises performing a downsampling or upsampling operation, and wherein the downsampling or upsampling operation comprises performing one or convolution operations without performing a space-to-depth or depth-to-space operation.
- the method described above may comprise performing the downsampling or upsampling operation on a CPU without performing a space-to-depth or depth-to-space operation, and wherein said downsampling or upsampling is configured to be performed in real-time or near real-time.
- the method as described above may optionally comprise performing the downsampling or upsampling operation on a neural accelerator without performing a space-to-depth or depth-to-space operation, and wherein said downsampling or upsampling is configured to be performed in real-time or near real-time.
- the method as described above may optionally comprise a downsampling operation that includes applying one or more convolutional layers with a kernel size based on a downsampling factor.
- the method described above may comprise an input image.
- the method may comprise a tensor representation of the input image.
- the method described above may comprise a downsampling operation performed by applying one or more convolutional layers configured with a stride equal to the downsampling factor.
- the number of filters in each convolutional layer is based on the original number of channels in the input tensor and on the downsampling factor.
- the method may include performing a first convolution and second deconvolution. This may further involve performing additional upsampling steps and utilizing additional layers such as Maxpool and Relu.
- the method described above may comprise the input including a latent representation.
- the method may comprise a tensor representation of the latent representation or the output image as part of its input.
- the method as described may include upsampling layers having strides determined by an upsampling factor.
- the method described above may further comprise applying an activation function after each convolutional layer in the upsampling operation.
- the method described above may include the upsampling layers being selected from a group consisting of nearest neighbor upsampling, bilinear upsampling, and bicubic upsampling, alternated with the convolutional layers.
- a method for lossy image or video encoding, transmission and decoding comprising the steps of receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, wherein the optical flow information is indicative of a difference between a representation of the second image and a representation of the input image; transmitting the optical flow information to a second computer system; decoding the optical flow information using a second neural network; and using the second image and the decoded optical flow information to produce an output image, wherein the output image is an approximation of the input image.
- the estimating of optical flow information further comprises estimating differences between the input image and the second image by applying a first convolution operation on a one or more pixels of a representation of the input image and/or on one or more pixels of a representation of the second image, wherein the convolution operation comprises applying one or more filters comprising weights having values randomly distributed between a minimum and a maximum value.
- the method comprises estimating a compressively encoded cost volume indicative of said differences by applying said first convolution operation.
- the first convolution substantially preserves a norm of a distribution of the respective pixels of the representation of the input image and/or respective pixels of the representation of the second image.
- a distribution of values of pixels of the representation of the input image and/or the distribution of values of pixels of the representation of the second image are sparse distributions in a spatial domain of the representation of the input image and/or the second image.
- a distribution of values of pixels of the representation of the input image and/or the distribution of values of pixels of the representation of the second image comprise sparse distributions in a spatial domain.
- the method described above may comprise assigning weights with values distributed according to a sub-Gaussian distribution.
- the method described above may comprise determining the minimum value and/or maximum value based on the number of channels of the input image and/or second image, kernel size of the first convolution operation, and/or pixel radius across which said differences are estimated.
- the method described above may comprise performing a second convolution operation on an output of the first convolution operation, wherein the second convolution operation substantially preserves a norm of a distribution of said output of the first convolution operation.
- the method described above may comprise estimating a difference between an output of the second convolution operation and an output of the first convolution operation.
- the difference may comprise an absolute difference.
- the difference defines a cost volume.
- the method described above may comprise using the optical flow information to warp a representation of the second image.
- the method may involve estimating a difference between the warped second image and the input image in order to create a residual representation of the input image relative to the warped second image.
- the method described above may comprise: (i) using a third neural network to encode the residual representation of the input image; (ii) transmitting the encoded residual representation of the input image to the second computer system; (iii) using a fourth neural network to decode the residual representation of the input image; and (iv) using decoded the residual representation of the input image to produce said output image.
- the method described above may comprise applying a third convolution operation to an output of the first convolution operation and/or to an output of the second convolution operation.
- a kernel size of the second convolution operation is greater than a kernel size of the first convolution operation.
- the first convolution operation is defined by a 1x1 kernel.
- the second convolution operation is defined by a 3x3 kernel.
- the third convolution operation is defined by a 1x1 kernel.
- the method described above may comprise performing the second convolution operation to entangle information associated with respective pixels of the representation of the input image with information associated with pixels adjacent corresponding pixels in the representation of the second image.
- the method described above may comprise a first, second, and where present third convolution operation, wherein these operations are performed without group convolutions.
- one or more outputs of the first, second and/or third convolution operation are stored in contiguous memory blocks, and wherein estimating a difference comprises retrieving said stored outputs from said contiguous memory blocks.
- a distribution of pixel values of the input image and of the second image are sparse and incoherent in a spatial domain and/or a transform of a spatial domain.
- a system configured to perform any of the above methods.
- a method for lossy image or video encoding and transmission includes receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, wherein the optical flow information is indicative of a difference between a representation of the second image and a representation of the input image; transmitting the optical flow information to a second computer system.
- estimating the optical flow information comprises estimating differences between the input image and the second image by by applying a first convolution operation on a one or more pixels of a representation of the input image and/or on one or more pixels of a representation of the second image, wherein the convolution operation comprises applying one or more filters comprising weights having values randomly distributed between a minimum and a maximum value.
- a method for lossy image or video decoding comprising receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, wherein the optical flow information is indicative of a difference between a representation of the second image and a representation of the input image based on a compressively encoded cost volume; receiving optical flow information at a second computer system, wherein the optical flow information is indicative of a difference between a representation of a second image and a representation of an input image; decoding the optical flow information using a second neural network; and using the second image and the decoded optical flow information to produce an output image, which approximates the input image.
- an apparatus configured to perform any of the above methods.
- a method for estimating a difference between a first image and a second image comprises performing a first convolution operation on respective pixels of a representation of the first image and on respective pixels of a representation of the second image; and estimating a difference between the first image and the second image based on one or more outputs of the first convolution operation on the first and second images by estimating a compressively encoded cost volume indicative of said differences.
- the method may include performing a second convolution operation on an output of the first convolution operation, and estimating a difference between an output of the second convolution operation and the first convolution operation.
- the method described above may comprise performing a second convolution operation that entangles information associated with respective pixels of the representation of the first with information associated with pixels adjacent corresponding pixels in the representation of the second image.
- the method described above may include a first convolution operation where one or more filters are applied with weights having values randomly distributed between a minimum value and a maximum value.
- the method described above may comprise determining the minimum value and/or maximum value based on the number of channels of the input image and/or second image, kernel size of the first convolution operation, and pixel radius across which said differences are estimated.
- the method described above may comprise a difference comprising an absolute difference.
- the method described above may comprise defining a cost volume based on the difference.
- the method described above may comprise applying a third convolution operation to an output of the first convolution operation and/or to an output of the second convolution operation.
- the method described may involve adjusting a kernel size of the second convolution operation to be larger than that of the first.
- the method described above may comprise a first convolution operation defined by a 1x1 kernel.
- the method described above may include the step whereby the second convolution operation is defined by a 3x3 kernel.
- the method described above may comprise the third convolution operation defined by a 1x1 kernel.
- the method described above may comprise storing a plurality of respective outputs of the first, second and/or third convolution operations in contiguous memory blocks, and wherein estimating a difference comprises retrieving said stored outputs from said contiguous memory blocks.
- the method described above may comprise using said difference to identify one or more pixel patches in the second image as movement-containing pixel patches, and generating a bounding box around one or more of said movement-containing pixel patches.
- a method of training one or more neural networks the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information, the optical flow information being indicative of a difference between the first image and the second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; evaluating a function based on a difference between the first image and the second image, the function comprising a Jacobian penalty term, updating the parameters of the first, second and/or third neural networks based on the evaluated function; and repeating the above steps using a first set of input images to produce first, second and
- the Jacobian penalty term is based on a rate of change of one or more first variables with respect to one or more second variables, the first variables and second variables selected from inputs and/or outputs associated with the one or more neural networks.
- at least input and/or output associated with the one or more neural networks is both a first variable and a second variable.
- the method comprises producing the second variables from the first variables by mapping the first variables to the second variables.
- the mapping is defined by an auxiliary function.
- the first variables are inputs to the auxiliary function and the second variables are outputs of the auxiliary function.
- at least one input of said inputs to the auxiliary function is also an output of the auxiliary function.
- the inputs of said mapping are defined in an input space, and the outputs of said mapping are defined in an output space, and wherein the auxiliary function maps the input space to the output space.
- the input space matches the output space.
- the auxiliary function is based on the third neural network.
- the third neural network comprises a residual decoder neural network.
- the at least one input to the auxiliary function that is also an output of the auxiliary function comprises said latent representation of the first image.
- the method comprises weighting the Jacobian penalty term.
- said weighting is based on a difference between the first image and the second image.
- said weighting is defined by a weighted norm based on a matrix associated with said rate of change.
- the method comprises estimating the Jacobian penalty term by approximating a norm of a matrix associated with said rate of change.
- approximating the norm of the matrix comprises making a single sample approxi- mation.
- the method comprises introducing the Jacobian penalty term into said function after a first number of said repeated steps.
- said first number of said repeated steps is based on a GOP-size of one or more frame sequences in said first set of input images.
- a method of performing lossy image or video encoding, transmission and decoding comprising the steps of: r eceiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; wherein the first neural network, the second neural network, and the third neural network are produced according to any of the methods described above.
- a method of performing lossy image or video encoding, transmission comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; w herein the first neural network,is produced according to any of the methods described above.
- a method of performing lossy image decoding comprising the steps of: receiving a latent representation of optical flow information at a second computer system;, the optical flow information being indicative of a difference between a first image and a second image; w ith a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; wherein the second neural network and the third neural network are produced according to any of the methods described above.
- a method of performing lossy image or video decoding comprising the steps of: with a second neural network, at a second computer system, decoding a latent represen- tation to produce a first output image, wherein the first output image is an approximation of one image of an image pair of a first sequence of input images; repeating the above step to produce a first sequence of output images, the first sequence of output images being an approximation of the first sequence of input images; wherein the second neural network is produced according to any of the methods described above.
- a data processing apparatus configured to perform any of the above methods.
- a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of the above methods.
- a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer carry out the method of any of the above methods.
- Figure 3 illustrates an example of a video compression, transmission and decompression pipeline.
- Figure 4 illustrates an example of a video compression, transmission and decompression system.
- Figure 5 illustrates an example of an image or video compression, transmission and decom- pression pipeline.
- Figure 6 illustrates an example of an image or video compression, transmission and decom- pression pipeline.
- Figure 7 illustrates an example of an image or video compression, transmission and decom- pression pipeline.
- Figure 8 illustrates an example of an image or video compression, transmission and decom- pression pipeline.
- Figure 9 illustratively shows an example sequence of layers of an image or video compression, transmission and decompression pipeline.
- Figure 10 illustratively shows an example sequence of layers of an image or video compression, transmission and decompression pipeline.
- Figure 11 illustrates an example of how optical flow information may be calculated between two images.
- Figure 12a illustrates steps of a MAD cost volume calculation.
- Figure 12b illustrates steps of a MAD cost volume calculation.
- Figure 12c illustrates steps of a MAD cost volume calculation.
- Figure 13 illustrates steps of an RKADe cost volume calculation.
- Figure 14 illustrates an example of an image or video compression, transmission and decom- pression pipeline.
- DETAILED DESCRIPTION OF THE DRAWINGS Compression processes may be applied to any form of information to reduce the amount of data, or file size, required to store that information.
- Image and video information is an example of information that may be compressed.
- the file size required to store the information, particularly during a compression process when referring to the compressed file may be referred to as the rate.
- compression can be lossless or lossy. In both forms of compression, the file size is reduced. However, in lossless compression, no information is lost when the information is compressed and subsequently decompressed. This means that the original file storing the information is fully reconstructed during the decompression process. In contrast to this, in lossy compression information may be lost in the compression and decompression process and the reconstructed file may differ from the original file.
- Image and video files containing image and video data are common targets for compression. In a compression process involving an image, the input image may be represented as ⁇ .
- the data representing the image may be stored in a tensor of dimensions ⁇ ⁇ ⁇ ⁇ ⁇ , where ⁇ represents the height of the image, ⁇ represents the width of the image and ⁇ represents the number of channels of the image.
- Each ⁇ ⁇ ⁇ data point of the image represents a pixel value of the image at the corresponding location.
- Each channel ⁇ of the image represents a different component of the image for each pixel which are combined when the image file is displayed by a device.
- an image file may have 3 channels with the channels representing the red, green and blue component of the image respectively.
- the image information is stored in the RGB colour space, which may also be referred to as a model or a format.
- colour spaces or formats include the CMKY and the YCbCr colour models.
- the channels of an image file are not limited to storing colour information and other information may be represented in the channels.
- a video may be considered a series of images in sequence, any compression process that may be applied to an image may also be applied to a video.
- Each image making up a video may be referred to as a frame of the video.
- the output image may differ from the input image and may be represented by ⁇ .
- the difference between the input image and the output image may be referred to as distortion or a difference in image quality.
- the distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way.
- the distortion function may comprise a trained neural network.
- the rate and distortion of a lossy compression process are related. An increase in the rate may result in a decrease in the distortion, and a decrease in the rate may result in an increase in the distortion. Changes to the distortion may affect the rate in a corresponding manner.
- a relation between these quantities for a given compression technique may be defined by a rate-distortion equation.
- AI based compression processes may involve the use of neural networks.
- a neural network is an operation that can be performed on an input to produce an output.
- a neural network may be made up of a plurality of layers.
- the first layer of the network receives the input.
- One or more operations may be performed on the input by the layer to produce an output of the first layer.
- the output of the first layer is then passed to the next layer of the network which may perform one or more operations in a similar way.
- the output of the final layer is the output of the neural network.
- Each layer of the neural network may be divided into nodes.
- Each node may receive at least part of the input from the previous layer and provide an output to one or more nodes in a subsequent layer.
- Each node of a layer may perform the one or more operations of the layer on at least part of the input to the layer. For example, a node may receive an input from one or more nodes of the previous layer.
- the one or more operations may include a convolution, a weight, a bias and an activation function.
- Convolution operations are used in convolutional neural networks. When a convolution operation is present, the convolution may be performed across the entire input to a layer. Alternatively, the convolution may be performed on at least part of the input to the layer.
- Each of the one or more operations is defined by one or more parameters that are associated with each operation.
- the weight operation may be defined by a weight matrix defining the weight to be applied to each input from each node in the previous layer to each node in the present layer.
- each of the values in the weight matrix is a parameter of the neural network.
- the convolution may be defined by a convolution matrix, also known as a kernel.
- one or more of the values in the convolution matrix may be a parameter of the neural network.
- the activation function may also be defined by values which may be parameters of the neural network.
- the parameters of the network may be varied during training of the network.
- Other features of the neural network may be predetermined and therefore not varied during training of the network. For example, the number of layers of the network, the number of nodes of the network, the one or more operations performed in each layer and the connections between the layers may be predetermined and therefore fixed before the training process takes place. These features that are predetermined may be referred to as the hyperparameters of the network. These features are sometimes referred to as the architecture of the network.
- a training set of inputs may be used for which the expected output, sometimes referred to as the ground truth, is known.
- the initial parameters of the neural network are randomized and the first training input is provided to the network.
- the output of the network is compared to the expected output, and based on a difference between the output and the expected output the parameters of the network are varied such that the difference between the output of the network and the expected output is reduced.
- This process is then repeated for a plurality of training inputs to train the network.
- the difference between the output of the network and the expected output may be defined by a loss function.
- the result of the loss function may be calculated using the difference between the output of the network and the expected output to determine the gradient of the loss function.
- Back-propagation of the gradient descent of the loss function may be used to update the parameters of the neural network using the gradients ⁇ / ⁇ of the loss function.
- a plurality of neural networks in a system may be trained simultaneously through back-propagation of the gradient of the loss function to each network.
- this type of system where simultaneous training with back-propagation through each element or the whole network architecture may be referred to as end-to-end, learned image or video compression.
- an end-to-end learned system learns itself during training what combination of parameters best achieves the goal of minimising the loss function.
- training means the process of optimizing an artificial intelligence or machine learning model, based on a given set of data. This involves iteratively adjusting the parameters of the model to minimize the discrepancy between the model’s predictions and the actual data, represented by the above-described rate-distortion loss function.
- the training process may comprise multiple epochs. An epoch refers to one complete pass of the entire training dataset through the machine learning algorithm.
- the model’s parameters are updated in an effort to minimize the loss function. It is envisaged that multiple epochs may be used to train a model, with the exact number depending on various factors including the complexity of the model and the diversity of the training data.
- the training data may be divided into smaller subsets known as batches.
- the size of a batch referred to as the batch size, may influence the training process.
- a smaller batch size can lead to more frequent updates to the model’s parameters, potentially leading to faster convergence to the optimal solution, but at the cost of increased computational resources.
- a larger batch size involves fewer updates, which can be more computationally efficient but might converge slower or even fail to converge to the optimal solution.
- the learnable parameters are updated by a specified amount each time, determined by the learning rate.
- the learning rate is a hyperparameter that decides how much the parameters are adjusted during the training process. A smaller learning rate implies smaller steps in the parameter space and a potentially more accurate solution, but it may require more epochs to reach that solution. On the other hand, a larger learning rate can expedite the training process but may risk overshooting the optimal solution or causing the training process to diverge.
- the training described herein may involve use of a validation set, which is a portion of the data not used in the initial training, which is used to evaluate the model’s performance and to prevent overfitting. Overfitting occurs when a model learns the training data too well, to the point that it fails to generalize to unseen data.
- Regularization techniques such as dropout or L1/L2 regularization, can also be used to mitigate overfitting.
- training a machine learning model is an iterative process that may comprise selection and tuning of various parameters and hyperparameters.
- the specific details, such as hyper parameters and so on, of the training process may vary and it is envisaged that producing a trained model in this way may achieved in a number of different ways with different epochs, batch sizes, learning rates, regularisations, and so on, the details of which are not essential to enabling the advantages and effects of the present disclosure, except where stated otherwise.
- an “untrained” neural network is considered be “trained” is envisaged to be case specific and depend on, for example, on a number of epochs, a plateauing of any further learning, or some other metric and is not considered to be essential in achieving the advantages described herein. More details of an end-to-end, learned compression process will now be described. It will be appreciated that in some cases, end-to-end, learned compression processes may be combined with one or more components that are handcrafted or trained separately. In the case of AI based image or video compression, the loss function may be defined by the rate distortion equation.
- ⁇ may be referred to as a lagrange multiplier.
- the langrange multiplier provides as weight for a particular term of the loss function in relation to each other term and can be used to control which terms of the loss function are favoured when training the network.
- a training set of input images may be used.
- An example training set of input images is the KODAK image set (for example at www.cs.albany.edu/ xypan/research/snr/Kodak.html).
- An example training set of input images is the IMAX image set.
- An example training set of input images is the Imagenet dataset (for example at www.image-net.org/download).
- An example training set of input images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at http://challenge.compression.cc/tasks/).
- An example of an AI based compression, transmission and decompression process 100 is shown in Figure 1.
- an input image 5 is provided.
- the input image 5 is provided to a trained neural network 110 characterized by a function ⁇ acting as an encoder.
- the encoder neural network 110 produces an output based on the input image. This output is referred to as a latent representation of the input image 5.
- the latent representation is quantised in a quantisation process 140 characterised by the operation ⁇ , resulting in a quantized latent.
- the quantisation process transforms the continuous latent representation into a discrete quantized latent.
- An example of a quantization process is a rounding function.
- the quantized latent is entropy encoded in an entropy encoding process 150 to produce a bitstream 130.
- the entropy encoding process may be for example, range or arithmetic encoding.
- the bitstream 130 may be transmitted across a communication network.
- the bitstream is entropy decoded in an entropy decoding process 160.
- the quantized latent is provided to another trained neural network 120 characterized by a function ⁇ acting as a decoder, which decodes the quantized latent.
- the trained neural network 120 produces an output based on the quantized latent.
- the output may be the output image of the AI based compression process 100.
- the encoder-decoder system may be referred to as an autoencoder.
- Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly compress given input data up to close to the fundamental entropy limit of that data, as determined by the total entropy of the distribution of that data.
- one way in which end-to-end, learned compression can minimise the rate loss term of the rate-distortion loss function and thereby increase compression effectiveness is to learn autoencoder parameter values that produce low entropy latent representation distributions.
- Producing latent representations distributed with as low an entropy as possible allows entropy encoding to compress the latent distributions as close to or to the fundamental entropy limit for that distribution. The lower the entropy of the distribution, the more entropy encoding can losslessly compress it and the lower the amount of data in the corresponding bitstream.
- this learning may comprise learning optimal location and scale parameters of the gaussian or Laplacian distributions, in other cases, it allows the learning of more flexible latent representation distributions which can further help to achieve the minimising of the rate-distortion loss function in ways that are not intuitive or possible to do with handcrafted features. Examples of these and other advantages are described in WO2021/220008A1, which is incorporated in its entirety by reference. Something which is closely linked to the entropy encoding of the latent distribution and which accordingly also has an effect on the effectiveness of compression of end-to-end learned approaches is the quantisation step.
- a rounding function may be used to quantise a latent representation distribution into bins of given sizes, a rounding function is not differentiable everywhere. Rather, a rounding function is effectively one or more step functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary between steps). Back propagating a gradient of a loss function through a rounding function is challenging. Instead, during training, quantisation by rounding function is replaced by one or more other approaches. For example, the functions of a noise quantisation model are differentiable everywhere and accordingly do allow backpropagation of the gradient of the loss function through the quantisation parts of the end-to-end, learned system.
- a straight-through estimator (STE) quantisation model or one other quantisation models may be used. It is also envisaged that different quantisation models may be used for during evaluation of different term of the loss function. For example, noise quantisation may used to evaluate the rate or entropy loss term of the rate-distortion loss function while STE quantisation may be used to evaluate the distortion term.
- noise quantisation may be used to evaluate the rate or entropy loss term of the rate-distortion loss function
- STE quantisation may be used to evaluate the distortion term.
- end-to-end learning of the quantisation process achieves a similar effect. That is, learnable quantisation parameters provide the architecture with a further degree of freedom to achieve the goal of minimising the loss function.
- the systems described above may be distributed across multiple locations and/or devices.
- the encoder 110 may be located on a device such as a laptop computer, desktop computer, smart phone or server.
- the decoder 120 may be located on a separate device which may be referred to as a recipient device.
- the system used to encode, transmit and decode the input image 5 to obtain the output image 6 may be referred to as a compression pipeline.
- the AI based compression process may further comprise a hyper-network 105 for the transmission of meta-information that improves the compression process.
- the hyper-network 105 comprises a trained neural network 115 acting as a hyper-encoder ⁇ h h ⁇ and a trained neural network 125 acting as a hyper-decoder ⁇ ⁇ .
- An example of such a is shown in Figure 2. Components of the system not further discussed may be assumed to be the same as discussed above.
- the neural network 115 acting as a hyper-decoder receives the latent that is the output of the encoder 110.
- the hyper-encoder 115 produces an output based on the latent representation that may be referred to as a hyper-latent representation.
- the hyper-latent is then quantized in a quantization process 145 characterised by ⁇ h to produce a quantized hyper-latent.
- the quantization process 145 characterised by ⁇ h may be the same as the quantisation process 140 characterised by ⁇ discussed above.
- the quantized hyper-latent is then entropy encoded in an entropy encoding process 155 to produce a bitstream 135.
- the bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the quantized hyper-latent.
- the quantized hyper-latent is then used as an input to trained neural network 125 acting as a hyper-decoder.
- the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder 115.
- the output of the hyper-decoder is used to provide parameters for use in the entropy encoding process 150 and entropy decoding process 160 in the main compression process 100.
- the output of the hyper-decoder 125 can include one or more of the mean, standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation.
- the residual value may be determined by subtracting the mean value of the distribution of latents or hyper-latents from each latent or hyper latent.
- the residual values may also be normalised.
- a training set of input images may be used as described above.
- the parameters of both the encoder 110 and the decoder 120 may be simultaneously updated in each training step.
- the parameters of both the hyper-encoder 115 and the hyper-decoder 125 may additionally be simultaneously updated in each training step.
- the training process may further include a generative adversarial network (GAN).
- GAN generative adversarial network
- an additional neutral network acting as a discriminator is included in the system.
- the discriminator receives an input and outputs a score based on the input providing an indication of whether the discriminator considers the input to be ground truth or fake.
- the indicator may be a score, with a high score associated with a ground truth input and a low score associated with a fake input.
- a loss function is used that maximizes the difference in the output indication between an input ground truth and input fake.
- the output image 6 may be provided to the discriminator.
- the output of the discriminator may then be used in the loss function of the compression process as a measure of the distortion of the compression process.
- the discriminator may receive both the input image 5 and the output image 6 and the difference in output indication may then be used in the loss function of the compression process as a measure of the distortion of the compression process.
- Training of the neural network acting as a discriminator and the other neutral networks in the compression process may be performed simultaneously.
- the discriminator neural network is removed from the system and the output of the compression pipeline is the output image 6. Incorporation of a GAN into the training process may cause the decoder 120 to perform hallucination.
- Hallucination is the process of adding information in the output image 6 that was not present in the input image 5.
- hallucination may add fine detail to the output image 6 that was not present in the input image 5 or received by the decoder 120.
- the hallucination performed may be based on information in the quantized latent received by decoder 120. Details of a video compression process will now be described. As discussed above, a video is made up of a series of images arranged in sequential order. AI based compression process 100 described above may be applied multiple times to perform compression, transmission and decompression of a video. For example, each frame of the video may be compressed, transmitted and decompressed individually. The received frames may then be grouped to obtain the original video.
- the frames in a video may be labelled based on the information from other frames that is used to decode the frame in a video compression, transmission and decompression process.
- frames which are decoded using no information from other frames may be referred to as I-frames.
- Frames which are decoded using information from past frames may be referred to as P-frames.
- Frames which are decoded using information from past frames and future frames may be referred to as B-frames.
- Frames may not be encoded and/or decoded in the order that they appear in the video. For example, a frame at a later time step in the video may be decoded before a frame at an earlier time.
- the images represented by each frame of a video may be related.
- a number of frames in a video may show the same scene.
- a number of different parts of the scene may be shown in more than one of the frames.
- objects or people in a scene may be shown in more than one of the frames.
- the background of the scene may also be shown in more than one of the frames. If an object or the perspective is in motion in the video, the position of the object or background in one frame may change relative to the position of the object or background in another frame.
- the transformation of a part of the image from a first position in a first frame to a second position in a second frame may be referred to as flow, warping or motion compensation.
- the flow may be represented by a vector.
- One or more flows that represent the transformation of at least part of one frame to another frame may be referred to as a flow map.
- An example AI based video compression, transmission, and decompression process 200 is shown in Figure 3.
- the process 200 shown in Figure 3 is divided into an I-frame part 201 for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be understood that these divisions into different parts are arbitrary and the process 200 may be also be considered as a single, end-to-end pipeline.
- I-frames do not rely on information from other frames so the I-frame part 201 corresponds to the compression, transmission, and decompression process illustrated in Figures 1 or 2.
- an input image ⁇ 0 is passed into an encoder neural network 203 producing a latent representation which is quantised and entropy encoded into a bitstream 204.
- the bitstream 204 is then entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed image ⁇ 0 which in this case is an I-frame.
- the decoding step may be performed both locally at the same location as where the input image compression occurs as well as at the location where the decompression occurs. This allows the reconstructed image ⁇ 0 to be available for later use by components of both the encoding and decoding sides of the pipeline.
- P-frames (and B-frames) do rely on information from other frames. Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only the input image ⁇ that is to be compressed (corresponding to a frame of a video stream at position t), but also one or more previously reconstructed images ⁇ 1 from an earlier frame t-1.
- the previously reconstructed ⁇ 1 is available at both the encode and decode side of the pipeline and can accordingly be used for various purposes at both the encode and decode sides.
- previously reconstructed images may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames.
- both the image being compressed ⁇ and the previously reconstructed image from an earlier frame ⁇ 1 are passed into a flow module part 206 of the pipeline.
- the flow module part 206 comprises an autoencoder such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 207 has been trained to produce a latent representation of a flow map from inputs ⁇ 1 and ⁇ , which is indicative of inter-frame movement of pixels or pixel groups between ⁇ 1 and ⁇ .
- the latent representation of the flow map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208.
- the bitstream is entropy decoded and passed to a decoder neural network 209 to produce a reconstructed flow map ⁇ .
- the reconstructed flow map ⁇ is applied to the previously reconstructed image ⁇ 1 to generate a warped image ⁇ 1, ⁇ .
- any suitable warping technique may be used, for example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further envisaged that a scale-space flow approach as described in the above paper may also optionally be used.
- the warped image ⁇ 1, ⁇ is a prediction of how the previously reconstructed image ⁇ 1 might have changed between frame positions t-1 and t, based on the output flow map produced by the flow module part 206 autoencoder system from the inputs of ⁇ and ⁇ 1.
- the reconstructed flow map ⁇ and corresponding warped image ⁇ 1, ⁇ may be produced both on the encode side and the decode side of the pipeline so they are available for use by other components of the pipeline on both the encode and decode sides.
- both the image being compressed ⁇ and the ⁇ 1, ⁇ are passed into a residual module part 210 of the pipeline.
- the residual module part 210 comprises an autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 211 has been trained to produce a latent representation of a residual map indicative of differences between the input mage ⁇ and the warped image ⁇ 1, ⁇ .
- the latent representation of the residual map is then quantised and entropy encoded into a bitstream 212 and transmitted.
- the bitstream 212 is then entropy decoded and passed into a decoder neural network 213 which reconstructs a residual map ⁇ from the decoded latent representation.
- a residual map may first be pre-calculated between ⁇ and the ⁇ 1, ⁇ and the pre-calculated residual map may be passed into an autoencoder for compression only.
- the residual map ⁇ is applied (e.g. combined by addition, subtraction or a different operation) to the warped image to produce a reconstructed image ⁇ which is a reconstruction of image ⁇ and accordingly corresponds to a P-frame at position t in a sequence of frames of a video stream. It will be appreciated that the reconstructed image ⁇ can then be used to process the next frame. That is, it can be used to compress, transmit and decompress ⁇ +1, and so on until an entire video stream or chunk of a video stream has been processed.
- the bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual map of each P-frame image.
- any of the autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those described in connection with Figure 2.
- the bitstream may also contain hyper and hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so on, of those networks as applicable.
- the above approach may generally also be extended to B-frames, for example as is described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6680-6689).
- the above-described flow and residual based approach is highly effective at reducing the amount of data that needs to be transmitted because, as long as at least one reconstructed frame (e.g. I-frame ⁇ 1) is available, the encode side only needs to compress and transmit a flow map and a residual map (and any hyper or hyper-hyper parameter information, as applicable) to reconstruct a subsequent frame.
- Figure 4 shows an example of an AI image or video compression process such as that described above in connection with Figures 1-3 implemented in a video streaming system 400.
- the system 400 comprises a first device 401 and a second device 402.
- the first and second devices 401, 402 may be user devices such as smartphones, tablets, AR/VR headsets or other portable devices.
- the system 400 of Figure 4 performs inference on a CPU (or NPU or TPU as applicable) of the first and second devices respectively. That is, compute for performing both encoding and decoding are performed by the respective CPUs of the first and second devices 401, 402.
- the CPU of first and second devices 401, 402 may comprise, for example, a Qualcomm Snapdragon CPU.
- the first device 401 comprises a media capture device 403, such as a camera, arranged to capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404.
- the video stream 404 is passed to a pre-processing module 406 which splits the video stream into blocks of frames, various frames of which will be designated as I-frames, P-frames, and/or B-frames.
- the blocks of frames are then compressed by an AI-compression module 407 comprising the encode side of the AI-based video compression pipeline of Figure 3.
- the output of the AI-compression module is accordingly a bitstream 408a which is transmitted from the first device 401, for example via a communications channel, for example over one or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409 communications.
- the second device 402 receives the communicated bitstream 408b which is passed to an AI-decompression module 410 comprising the decode side of the AI-based video compression pipeline of Figure 3.
- the output of the AI-decompression module 402 is the reconstructed I-frames, P-frames and/or B-frames which are passed to a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402.
- a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402.
- the system 400 of Figure 4 may be used for live video streaming at 30fps of a 1080p video stream, which means a cumulative latency of both the encode and decode side is below substantially 50ms, for example substantially 30ms or less. Achieving this level of runtime performance with only CPU (or NPU or TPU) compute on user devices presents challenges which are not addressed by known methods and systems or in the wider AI-compression literature.
- execution of different parts of the compression pipeline during inference may be optimized by adjusting the order in which operations are performed using one or more known CPU scheduling methods.
- Efficient scheduling can allow for operations to be performed in parallel, thereby reducing the total execution time.
- efficient management of memory resources may be implemented, including optimising caching methods such as storing frequently-accessed data in faster memory locations, and memory reuse, which minimizes memory allocation and deallocation operations.
- FIG. 5 illustrates an example of an image or video compression, transmission and decompres- sion pipeline 500.
- the pipeline illustrates a method of the present disclosure that corresponds to that described in relation to Figures 1 to 4. Like-numbered features correspond to those in Figures 1 to 4.
- pipeline is further wrapped in a super-resolution wrapper. That is, the encoder ⁇ is preceded by a downsampler 501, and decoder ⁇ is followed by an upsampler 502.
- the super-resolution wrapper 501, 502 around the pipeline 500 during training which comprises making the evaluated loss function be based on one or more terms from the list of: a difference between the output image ⁇ and the input image ⁇ , a difference between the output image ⁇ and the downsampled input image ⁇ , a difference between the upsampled output image ⁇ and the input image ⁇ , and/or a difference between the upsampled output image ⁇ and the downsampled input image ⁇ . Modifying the loss function in this way allows the pipeline to be "super-resolution" aware.
- the loss function comprises a term that is not just based on differences between the input to the encoder and the output of the decoder, but also or alternatively on the input ⁇ and the output ⁇ of the downsampler 501 and/or the input ⁇ and output ⁇ of the upsampler 502.
- the weights of the neural networks of the pipeline 500 e.g.
- the encoder ⁇ , the decoder ⁇ , and/or the corresponding hyper encoder and hyper decoders will be optimised to allow the encoder ⁇ to produce a latent representation that has a distribution that is optimally entropy encodable to hit low target bit rates while at the same time allow the decoder ⁇ to output images that have distributions that can be optimally upsampled by the upsampler 502 to produce upsampled output images ⁇ that are as close to the original input images ⁇ as possible.
- making the neural compression pipeline 500 super-resolution aware in this way during training results in trained networks of the pipeline 500 that, when wrapped in the super-resolution down- and/or up-samplers 501, 502 during inference, produces output images ⁇ that are closer to the the input images ⁇ than a network or networks that were not trained in a super-resolution aware manner.
- the method comprises receiving an input image ⁇ at a first computer system, downsampling the input image with a downsampler 501 to produce a downsampled input image ⁇ , encoding the downsampled input image ⁇ using a first neural network ⁇ to produce a latent representation, decoding the latent representation using a second neural network ⁇ to produce an output image ⁇ , wherein the output image ⁇ is an approximation of the input image (e.g. of the downsampled input image ⁇ which in turn is an approximation of the original input image ⁇ ), upsampling the output image ⁇ with an upsampler 502 to produce an upsampled output image ⁇ , evaluating a function (i.e.
- the upsampler 502 may comprise a third neural network, and the method further comprises updating the parameters of the third neural network based on the evaluated function.
- the upsampler 602 is shown as a neural network ⁇ .
- the upsampler neural network ⁇ is responsible for upscaling the output image obtained from the second neural network back to its original size or resolution.
- the upsampler neural network ⁇ may comprise a convolutional neural network architecture.
- the upsampler may be implemented using transposed convolutions or deconvolution layers. These layers perform the inverse operation of regular convolutions and can be used to increase the spatial resolution of an image.
- the upsampling process can be further enhanced using various techniques such as skip connections or residual connections.
- Skip connections allow for direct transmission of information from earlier layers in the network to later layers, bypassing some of the intermediate layers and thereby allowing the model to leverage detailed information present in the initial stages of processing. Residual connections add the output of a layer directly to the input of another layer, effectively performing addition or subtraction operations within the network.
- These techniques can improve the accuracy and stability of the neural network-based upsampler by allowing it to better capture fine details in the image. More specifically, a neural network based upsampler such as ⁇ can be trained together, e.g. in an end-to-end manner with the trainable neural networks of the neural compression pipeline 600, making the entire pipeline super-resolution aware. This is an important distinction compared to simply applying a super resolution wrapper to a neural or traditional compression pipeline.
- the downsampler may be a traditional downsampler, for example it is envisaged that the downsampler may comprise a bilinear or bicubic downsampler.
- Bilinear and bicubic downsampling are exemplary methods used for image resizing. They involve reducing the resolution of an input image, for example by a factor of 2x2 (e.g., from 100x100 to 50x50). Further exemplary details of bilinear and bicubic downsampling are provided below.
- Bilinear Downsampling In bilinear downsampling, the algorithm approximates the original pixel values based on the average intensity values of its surrounding pixels in the scaled image. This method assumes that the pixel intensities are uniformly distributed across the image. The basic implementation for a 2x2 downsampling operation is as follows: 1. Read the four input pixels (A, B, C, and D). 2. Set the output pixel value to be an average of the input pixels. 3.
- Bicubic Downsampling In bicubic downsampling, the algorithm uses a third-order polynomial function to approximate the original pixel values based on the intensities of a neighborhood surrounding the current pixel. This method takes into account more details than bilinear interpolation but requires more computational resources.
- the basic implementation for a 2x2 downsampling operation is as follows: 1. Read the four input pixels (A, B, C, and D). 2. Calculate the coefficients of the third-order polynomial function. 3. Calculate the output pixel values: x and y, using the coefficients obtained in step 2. In the context of the the compression pipeline 600 shown in Figure 6, either bilinear and bicubic downsampling methods can be used.
- the evaluated function i.e. the loss function
- the evaluated function may further be based on said differences comprising a structural similarity index measure (SSIM).
- SSIM is a quality metric that compares two images in terms of their structure and contrast. For example, when used here it evaluates the similarity between an output image and its corresponding input image or downsampled input image, upsampled output image, and/or any combination thereof.
- the method aims to optimize the neural networks for preserving the structural information in the images during the encoding and decoding process, thus improving the overall quality of the generated output images.
- the human visual system is often able to perceive these higher level structural information, it allows the nerwork to learn to optimise for this type of difference over a simpler mean square error (MSE) loss.
- MSE may be used as it is quicker and simpler to calculate and can accordingly speed up training times.
- the downsampling can be performed using a Gaussian blur filter. That is, it is envisaged that downsampling can be achieved through an implementation wherein the input image ⁇ is filtered with a Gaussian blur.
- the Gaussian blur filter is a type of low-pass filter that smoothes out the image by reducing high-frequency details while preserving lower frequency information. This helps to reduce the complexity of the input image and makes it easier for the first, second and third neural networks to learn the underlying patterns in the data and may help the loss to converge during training.
- the downsampled input image ⁇ produced is a smoother representation of the original input image, which can be used as an input for encoding using the first neural network ⁇ . It is envisaged that end-to-end training is preferable as it makes the pipeline fully super- resolution aware. However, it can be challenging to get the loss to converge during training and/or for training to be stable when all neural networks are being optimised simultaneously.
- the method may comprise (i) updating the parameters of the first neural network ⁇ and the second neural network ⁇ based on the evaluated function for a first number of said steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the third neural network ⁇ , (ii) freezing the parameters of the first and second neural networks ⁇ , ⁇ after said number first number of steps, and performing said upsampling and downsampling, and said updating of the parameters of the third neural network ⁇ for a second number of said steps. That is, the underlying compression pipeline 600 is trained first.
- the parameters of the first and second neural networks ⁇ , ⁇ are frozen, and then the system proceeds with a secondary training phase where it updates the third neural network’s ⁇ parameters.
- this split training approach can mitigate training instability.
- the downsampler 701 may consist of a fourth neural network ⁇ . Further the training method also comprises updating the parameters of this fourth neural network ⁇ based on the evaluated function.
- This approach provides the end-to-end system with an extra degree of freedom to produce downsampled input images ⁇ that may not be in any way visually pleasing or accurate representations of ⁇ but which can be optimally encoded by ⁇ into a latent representation that is distributed in a way that is efficiently entropy encodable and which can be decoded and subsequently upsampled into more accurate reconstructions of ⁇ than would otherwise be possible with a pipeline that is not super-resolution aware (in this case down-sampler aware).
- the output of the neural network downsampler ⁇ can produce whatever output is needed by the neural compression pipeline to help achieve a target bitrate while maintaining final output image accuracy.
- One examplary downsampler having a neural network architecture is a network comprising a plurality of convolutional layers with decreasing filter sizes and increasing strides, as this approach can effectively reduce the spatial dimensions of the input image while maintaining its overall structure.
- CNN convolutional neural network
- the CNN architecture typically consists of multiple layers, each layer being composed of several filters applied across the spatial dimensions of the input image.
- filters are learnable parameters that enable the network to extract various features from the image and recognize complex patterns or shapes within it.
- a downsampler we can start with an initial convolutional layer having a large filter size (e.g., 7x7) and small strides (e.g., 2x2). This combination results in a significant reduction in spatial dimensions while still allowing the network to capture essential information from the input image.
- Subsequent layers can then use smaller filters (e.g., 3x3, 5x5) with larger strides (e.g., 2x2, 4x4), further reducing the size of the feature maps while also encouraging more localized receptive fields within the network.
- the choice of downsampler, and filter sizes and strides in a downsampler controls the balance between preserving important image details and efficiently processing the data.
- the downsampler may comprise multiple convolutional layers with decreasing filter sizes and increasing strides, followed by one or more max-pooling layers to further reduce the spatial dimensions of the input image. Reducing the number of layers and using small filters or kernels helps to speed up run time.
- a further illustrative downsampler may be comprise a network architecture with a number of layers with a stride greater than 1. Every such layer will downsample by a factor of the stride.
- the upsampler may comprise a bilinear or bicubic upsampler.
- Bilinear upsampling is a method for upsampling an image. It involves estimating pixel values by linearly interpolating between the neighboring pixels in the original and downsampled images.
- An example implementation algorithm may be as follows: 1.
- An example implementaiton algorithm may be as follows: 1. For each pixel in the output image, find its corresponding pixel in the input image. 2. Find 16 nearby pixel coordinates around this central coordinate of the input image. These are typically referred to as northeast (NE), north-northeast (NNE), northwest (NW), southwest (SW), southeast (SE), south-southeast (SSE), south (SO), and south-southwest (SSW). 3. Fit a bicubic function with coefficients that comprise weighted sums of the values of the input pixels. 4. Repeat steps 1-3 for every pixel in the output image.
- bilinear and bicubic interpolation can produce reasonably good results when upsampling images, but the choice between them will depend on the specific use case and desired level of detail preservation.
- bilinear is envisaged to be preferred as it is faster and can reduce runtime while working in a pipeline 700 with a (typically slower) neural network based downsampler such as ⁇ on the encode side.
- a neural network upsampler ⁇ training in a fully end-to-end manner may introduce training instability and slow convergence. To address this, the training may be split into phases.
- the method may comprise (i) updating the parameters of the first neural network and the second neural network based on the evaluated function for a first number of said steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the fourth neural network, (ii) freezing the parameters of the first and second neural networks after said number of steps, and performing said downsampling and said updating of the parameters of the fourth neural network for a second number of said steps.
- training a neural network based downsampler ⁇ in an end-to-end manner is further complicated by it not being straightforward as to what the downsampler’s training objective should be i.e. what its loss ought to be based on.
- the loss terms that include the output of the downsampler ⁇ compare the input image ⁇ and the immediate output of the downsampler ⁇ i.e. ⁇ , or on a downsampled version of ⁇ that was produced by a previously downsampled image (e.g. created by a traditional downsampling method) to try to teach the downsampler to mimic a traditional downsampler, or on some other difference.
- both the upsampler and downsampler may be neural networks. This is shown in Figure 8 which corresponds to Figure 5 but where the upsamplers and downsamplers comprise neural networks. Like-numbered references indicate like features which are not repeated here for brevity.
- FIG. 8 illustratively shows a pipeline 800 comprising a neural network downsampler ⁇ 801 and a neural network upsampler ⁇ 802 wrapped around a neural compression pipeline. It is envisaged that these may be trained in an end to end fashion. More specifically, by making the loss function be based on comparisons between not just the input image ⁇ and the final upsampled output image ⁇ , but also between the outputs of the downsampler ⁇ , and the various other outputs of the pipeline, as well as optionally a previously downsampled image, the network learns to become super-resolution aware and outperforms networks where the training of the neural compression pipeline is not connected in any way either through training or through the calculated comparisons of the terms of the loss function.
- the methods described above may comprise entropy encoding the latent representation into a bitstream with a specified length, wherein the function used in the method is also dependent on said bitstream length. That is, the loss function further comprises a rate term. Including the rate term in the loss function allows the networks to learn to optimise for bit rates (e.g. in bits per pixel) simultaneously with image reconstruction accuracy. Some or all of the loss terms may be scaled or weighted with respect to each other to focus the learning on any of the objectives as defined by the different loss terms.
- the difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image may be determined based on the output of a fifth neural network acting as a discriminator.
- the output of the discriminator may be the differentiation between a ground truth image and a "fake" (i.e. compressed) reconstructed image during training.
- the discriminator loss term used in the training of the encoder/decoders of the AI compression pipeline are only a function of the compressed image.
- the training tries to encourage the neural networks to change in such a way that its output will be more realistic and like the ground truth image.
- Faithfulness to the ground truth image is taken care of by the distortion loss term (e.g. mean squared error) or other loss.
- the evaluation of the function based on these differences guides the process of updating the parameters of the first neural network (encoder) and the second neural network (decoder), leading to improved performance and better results over time. This approach can help to improve the overall quality of the generated output images.
- the difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image comprises a mean squared error (MSE) and/or a structural similarity index measure (SSIM).
- MSE measures the average squared difference between two images
- SSIM computes structural similarity based on luminance, contrast, and structure.
- the function may further comprise a term defining a visual perceptual metric that models how a human visual system may perceive differences.
- the term defining a visual perceptual metric may comprise a MS-SIM loss. This loss function serves to gauge how effectively the network is approximating the input image with the output image. By iteratively minimizing this loss function through parameter updates in the neural networks, the trained neural network improves its ability to generate an output image that closely resembles the input image.
- the above described methods may be used in the context of any pre-trained neural compression network, and accordingly the present disclosure envisages a method where only the weights of the upsampler and/or downsampler are updated during training. Accordingly, such a method comprises receiving an input image at a first computer system, encoding the input image using a first neural network to produce a latent representation, decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image. Next, the output image is upsampled with an upsampler comprising a third neural network.
- the difference between one or both of: the output image and the input image, and/or the upsampled output image and the input image is evaluated using a function, parameters of the third neural network are updated based on the evaluated function. These steps are then repeated using a first set of input images to produce a first trained neural network and a second trained neural network.
- the present disclosure envisages a method comprising receiving an input image at a first computer system, downsampling the input image with a downsampler comprising a fourth neural network to produce a downsampled input image, encoding the downsampled input image using a first neural network to produce a latent representation, decoding the latent representation using a second neural network to produce an output image that approximates the input image, evaluating a function based on differences between the output image and other images, updating the parameters of the fourth neural network based on the evaluated function, and repeating these steps with a first set of input images to create trained versions of the first and second neural networks.
- producing the previously downsampled input image may be performed by either bilinear or bicubic downsampling techniques.
- the present disclosure also proposes using a network trained in accordance with the above-described methods. That is: receiving an input image at a first computer system, downsampling the input image with a downsampler, encoding the downsampled input image using a first trained neural network to produce a latent representation, transmitting the latent representation to a second computer system, decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image, and upsampling the output image with an upsampler to produce an upsampled output image.
- a bitrate ladder refers to a set of predefined bitrates applied within an encoded file.
- bitrate ladder refers to a series of different bitrates that can be chosen to achieve the desired trade-off between video quality and file size.
- a balance is struck between two conflicting goals: achieving high visual quality while minimizing the file size.
- the process involves converting the raw data into a compressed format that requires less storage space. To accomplish this task in traditional compression, a number of known algorithms are used, often dictated by compression codec standards.
- H.264/AVC Advanced Video Coding
- the H.264 standard includes various profiles and levels that define the maximum bitrate and other parameters for a given video stream. Implementers of the standard typically target these profiles and levels to ensure their implementation is standard compliant. More specifically, When using these standards, implementers can choose from different preset bitrates. These predefined bitrates are often referred to as a "ladder" because they represent a series of steps or options available when choosing the optimal encoding settings for a given video file.
- bitrate ladders make use of the idea that the encode resolution can be varied a priori whereby the streaming provider has already pre-encoded its content at a plurality of different resolutions, which in turn facilitates giving an end user a choice of what quality setting to apply given some particular bitrate budget.
- the bitrate ladder approach works by progressively decreasing the bitrate from one level to another, allowing a given balance between quality and file size to be found. For example, starting at a higher-than-desired bitrate, you can gradually reduce it until you reach an acceptable level of visual degradation without sacrificing too much detail or clarity.
- the neural networks are typically trained to perform optimally at a given bitrate.
- covering all the predefined bitrates of a given bitrate ladder may require training separate neural networks for each predefined bitrate which can be burdensome and may result in the final codec library memory footprint being potentially very large.
- the above-described approaches may be used to make the base, neural compression neural networks super-resolution aware and allows a given base, neural compression pipeline to be used not only for compression to its targeted bitrate, but also to other bitrates in the bitrate ladder by applying the super-resolution wrapper around the base models when desired.
- a research-stage approach that works well and fast on a GPU such as an NVIDIA 4090, A10 or A100 card is very unlikely to achieve the same performance on resource-constrained mobile device platforms such as laptops, tablets and smartphones.
- One area of AI-based video compression where this is particularly problematic is in the implementation of downsampling and upsampling algorithms. More particularly, one common component of such upsampling and downsampling algorithms is a process known as PixelShuffle and PixelUnshuffle. Both operations manipulate the arrangement of data in tensors (multi-dimensional arrays) that represent images. PixelShuffle is often used in super-resolution models. That is, in general terms, PixelShuffle increases the resolution of an input image by rearranging the elements of a tensor.
- the pseudocode below illustrates a PixelShuffle operation: Input Tensor: shape [ ⁇ h ⁇ , ⁇ ⁇ ⁇ 2, ⁇ , ⁇ ], where: C is the number of channels (e.g., 3 for an RGB image), r is the upscale factor, H and W are the height and width of the tensor.
- Rearrangement of Data PixelShuffle rearranges elements in this tensor to form a new tensor of shape [ ⁇ h ⁇ , ⁇ , ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ ]. Essentially, it "shuffles" the data from the channel dimension into the spatial dimensions (height and width). Upscaling: This operation effectively upscales the image by a factor of r, increasing the resolution.
- each pixel in the original image is rearranged to form a 2x2 block in the output image.
- PixelShuf- fle is used in the latter stages to upscale the low-resolution input to a high-resolution output. It’s a part of the sub-pixel convolution technique where the model first increases the number of channels with additional convolutions and then uses PixelShuffle to upscale the image spatially.
- PixelUnshuffle is the reverse operation of PixelShuffle. It’s used to decrease the spatial resolution of an image while increasing the number of channels.
- PixelUnshuffle operation Input Tensor: shape [ ⁇ h ⁇ , ⁇ , ⁇ , ⁇ ].
- Rearrangement of Data PixelUnshuffle rearranges elements to form a new tensor of shape [ ⁇ h ⁇ , ⁇ ⁇ ⁇ 2, ⁇ / ⁇ , ⁇ / ⁇ ]. It does so by taking spatial blocks of size r x r and stacking them depth-wise in the channel dimension.
- PixelUnshuffle can be used in tasks like feature extraction, where reducing spatial resolution while retaining information in the channel dimension might be beneficial. It’s also useful in certain generative models or autoencoders where manipulating spatial resolution at different stages of the network is required. More generally, PixelShuffle is used for upscaling an image by rearranging the channel data into the spatial dimensions, whereas PixelUnshuffle does the opposite, downscaling an image by rearranging spatial data into the channel dimension. PixelShuffle and PixelUnshuffle are specific implementations of "depth-to-space” and "space- to-depth” operations. These can be explained in their generalised form as follows.
- the pseudocode below illustrates a depth-to-space operation:
- Input Tensor The operation takes an input tensor of shape [ ⁇ h ⁇ , ⁇ ⁇ ⁇ 2, ⁇ , ⁇ ], where C is the number of channels, r is the upscale factor, and H and W are the height and width.
- Rearrangement of Data It rearranges the elements of this tensor to form a new tensor of shape [ ⁇ h ⁇ , ⁇ , ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ ]. This rearrangement involves redistributing the elements from the depth (channels) into the spatial dimensions (height and width).
- Upscaling Effect The result is an upscaling of the image or feature map by a factor of r, with each pixel in the original tensor contributing to a block of pixels in the output tensor.
- the pseudocode below illustrates a space-to-depth operation: Input Tensor: the input of a space-to-depth operation is a tensor of shape [ ⁇ h ⁇ , ⁇ , ⁇ , ⁇ ].
- Rearrangement of Data Space-to-Depth rearranges elements to produce a new tensor of shape [ ⁇ h ⁇ , ⁇ ⁇ ⁇ 2, ⁇ / ⁇ , ⁇ / ⁇ ]. It does this by taking blocks of pixels from the spatial dimensions and stacking them in the channel dimension.
- the above improvement is envisaged to be used in the context of, for example super- resolution such as that described in concept 1 such as in the upsamplers and/or downsamplers in Figures 5 to 8. It is generally applicable to any instances where upsampling or downsampling might be performed in a neural compression pipeline.
- one or more layers in the first or second neural networks of Figures 1, 2, 11 or 14 may be configured to downsample or upsample an input.
- the flow module 206 in Figure 3 may comprise one or more layers or modules configured to downsample the input. This is one way to speed up runtime as the flow often need not be estimated in as high a resolution as the input image as the output quality of reconstructed images created with high resolution flow can be similar to that of the quality of reconstructed images created with a low resolution flow. This downsampling, if performed using traditional space-to-depth would be a bottleneck.
- the corresponding inverse upsampling may then be applied at the output of the flow module 206. again using convolutional operations rather than depth-to-space.
- a corresponding set of operations may be performed with the residual module of Figure 3, in any of the modules of the hypernetwork in Figure 2, and/or in any of the compression Pipeline of Figure 1.
- An exemplary implementation of mimicking space-to-depth may be as follows. CONVOLUTIONAL LAYER SETUP: Kernel Size: The kernel size is envisaged to match the block size that is to be mimicked.
- the corresponding convolution kernel size would be r x r (2x2 in the above example).
- Stride The stride size is envisaged to equal the block size (r). This ensures that the convolutional filters move across the image in steps equal to the block size, effectively capturing the spatial blocks.
- Number of Filters It is envisaged that the number of filters is set to ⁇ ⁇ ⁇ 2, where C is the original number of channels.
- each filter produces an output that corresponds to one depth level in the Space-to-Depth transformation
- SEQUENTIAL CONVOLUTION LAYERS To fully replicate Space-to-Depth, it is envisaged that a number of convolutions may be used sequentially, e.g. by using a series of convolutional layers. This allows for the handling of where the channel increase ( ⁇ ⁇ ⁇ 2) is significant. Each layer progressively accumulates more spatial information into the depth dimension. For completeness, it is also possible to replicate space-to-depth with a single strided convolution. Splitting it into multiple convolutions with activations between them provides additional expressive power, but isn’t needed to just replicate the functionality of space-to-depth.
- ACTIVATION FUNCTIONS It is also envisaged that one or more activation functions may be applied between sequential convolution layers. These help in spreading out the information spatially.
- An exemplary activation function may be ReLU, which introduces a non-linearity and helps in learning spatial patterns.
- CHANNEL REARRANGEMENT Finally, after these convolutional operations, the output channels may optionally be rearranged to match the order that a space-to-depth operation produces. Alternatively, this can be done at any point in the process, and need not be done "on the fly”.
- An exemplary implemenation of mimicking depth-to-space i.e.
- C ONVOLUTIONAL LAYER SETUP Kernel Size: A kernel size is envisaged that aligns with the desired spatial expansion. For example, if the upscale factor is r, a larger kernel size (like 3x3 or larger) can be more effective in spreading out the information across a larger spatial area.
- the depth-to-space operation can be implemented with a single transposed convolution with stride equal to the upsampling factor. It is also possible to make a more expressive process by using larger kernel sizes and/or splitting a more extensive upsample into multiple stages. Stride: It is envisaged that the stride may be set to 1, ensuring a uniform spread of information.
- stride may be equal to the upsampling factor, as described above.
- Number of Filters This may be less than the original number of channels to reduce the channel dimension gradually. The exact number can vary depending on the architecture and desired output.
- SEQUENTIAL CONVOLUTION LAYERS Multiple convolutional layers may be advantageous, especially if the change from depth to spatial dimensions is significant. Each layer can gradually increase the spatial dimensions and reduce the depth.
- ACTIVATION FUNCTIONS It is also envisaged that one or more activation functions may be applied between sequential convolution layers. These help in spreading out the information spatially.
- An exemplary activation function may be ReLU, which can introduce non-linearity and help in learning spatial patterns.
- UPSAMPLING LAYERS Alongside the convolutional layers, upsampling layers (like nearest neighbor or bilinear upsampling) can be used to increase the spatial dimensions. These can be alternated with convolutional layers to progressively achieve the desired spatial expansion.
- upsampling layers like nearest neighbor or bilinear upsampling
- any suitable kernel size, stride and filter numbers are envisaged, as are the number of optional sequential convolution layers, activation functions and other steps.
- Figure 9 shows an example sequence of layers of a neural network which takes an input image and downsamples it.
- the sequence of layers comprises a 3x3 Conv layer, a ReLU activation function, a space-to-depth (2x) operation, a 1x1 conv layer, another ReLU activation layer and finally a depth-to-space (3x) operation.
- This approach uses space-to-depth and depth-to-space.
- Figure 10 illustratively shows an example sequence of layers of a neural network or compression pipeline, for example one or more of the neural networks or compression pipelines shown in any of Figures 1 to 8 but where the space-to-depth and depth-to-space operations have been replaced by convolution operations.
- the depth-to-space replacement comprises a strided transposed convolution and the space-to-depth replacement comprises a strided convolution.
- This implementation substantially reduces the bottlenecks when running in resource-constrained environments such as on a CPU.
- Concept 3 Regional Kernel Absolute Deviation (RKADe) for Flow In image processing
- Mean Absolute Difference (MAD) is a technique for detecting and numerically estimating differences between pixels and/or pixel patches, that is, differences between the values of a one or more pixels in one image and the values one or more pixels in another image.
- FIG. 11 illustrates an example of a flow module, in this case a network 1100, configured to estimate information indicative of a difference between an image ⁇ 1 and an image ⁇ , e.g. flow information.
- Figure 11 is provided as a non-limiting example of how such flow information may be calculated between two images.
- the flow module may be used in or together with the flow module part 206 ( Figure 3) of the compression pipeline.
- the network 1100 in Figure 5 comprises a set of layers 501a, 501b respectively for an image ⁇ 1 and an image ⁇ from respective times or positions in a sequence ⁇ ⁇ 1 and ⁇ of a sequence of frames.
- the set of layers 1101a, 1101b may define one or more convolution operations and/or nonlinear activations (for example as described in concept 2 above) to sequentially downsample the input images to produce a pyramid of feature maps for different levels of coarseness or spatial resolution. This may comprise performing h/2 ⁇ /2 downsampling in a first layer, h/4 ⁇ /4 downsampling in a second layer h/8 ⁇ /8 downsampling in a third layer, h/16 ⁇ /16 downsampling in a fourth layer, h/32 ⁇ /32 downsampling in a fifth layer, h/64 ⁇ /64 downsampling in a sixth layer, and so on.
- a first cost volume 1102 is calculated at the most coarse level between the feature map pixels of the first image ⁇ 1 and the corresponding feature map pixels of the second image ⁇ .
- Cost volumes define the matchmaking cost of matching the pixels in one image with the pixels in a second image (which may be later in time, or earlier in time, for example due to the order in which B-frame processing typically occurs which is not necessarily the chronological order of the frames).
- the closeness of each pixel, or a subset of all pixels, in the initial image to one or more pixels in the later image is determined with a measure of similarity such as a vector or dot product, a cosine similarity, a mean absolute difference, or some other measure of similarity.
- This metric may be calculated against all pixels in the later image, or only for pixels in a predetermined search radius such as a 1-10 pixel radius (preferably a 1, 2, 3, or 4 pixel radius), or some other radius as described in connection with concept 4 below, around the pixel coordinate corresponding to the pixel against which the closeness metric is being calculated. This process is computationally expensive in floating points space but which can be implemented efficiently in integer or fixed point space.
- a first flow 1103 can be estimated from the first cost volume 1102. This may be achieved using, for example a flow extractor network which may comprise a convolutional neural network comprising a plurality of layers trained to output a tensor defining a flow map from the input cost volumes. Other methods of calculating flow information from cost volumes will also be known to the skilled person.
- the same process is then repeated for the other levels of feature map coarseness to calculate a second cost volume 1104 and second flow 1105, and so on for the cost volumes and flows associated with each of the levels of coarseness until they have all been calculated, up to the final cost volume 1106 and flow 1107.
- the weights and/or biases of any activation layers in network 1100 are trainable parameters and can accordingly be updated during training either alone, or in an end to end manner with the rest of the compression pipeline.
- the trainable nature of these parameters provides the network 1100 with flexibility to produce feature maps at each level of spatial resolution (i.e. pyramid feature maps) and/or at the flow outputs that are forced into a distribution that best allows the network to meet its training objective (e.g. better compression, better reconstruction accuracy, more accurate reconstruction of flow, and so on).
- the network 1100 allows the network 1100 to produce feature maps that, when cost volumes and/or flows are calculated therefrom, produce cost volumes or flows that are distributed roughly matching the latent representation distribution that would previously have been expected to be output by a dedicated flow encoder module.
- the flow of the previous level or levels of coarseness or resolution may be used to warp 1108, 1109, the feature maps before the cost volume is calculated. This has the effect of artificially reducing the amount of relative movement between the pixels of the t and t - 1 images or feature maps when calculating the cost volumes, reducing flow errors for high movement details.
- the flow estimation output may be upsampled 1110, 1111 (for example using the methods of concept 2, or using any other upsampling method) first to match the coarseness resolution of the feature map to which the flow is being applied in the warping process.
- the outputs of the flow module may accordingly be one or more cost volumes or some representation thereof, and/or one or more flows or some representation thereof).
- the flow or representation thereof may then be transmitted in a bitstream and decoded by a flow decoder, the output of which may in turn be used to warp a previously decoded image for use in the residual encoder/decoder 1110 arrangement as shown in e.g. Figure 3.
- the cost volumes may be used to compute local (translational) alignment through patch-wise comparisons. For example, let ⁇ , ⁇ ⁇ R ⁇ be two tensors each with ⁇ ⁇ N channels and spatial dimensions ⁇ ⁇ ⁇ ⁇ N2.
- the function above ⁇ , ⁇ is the inner product and means that the cost volume is constructed by computing the channel-wise correlation. MAD may be used in the calculation of cost volumes as follows.
- P : R ⁇ ⁇ R ⁇ (2 ⁇ +1)2 ⁇ be a patch operator that associates to each pixel ⁇ ⁇ ⁇ centered at pixel ⁇ for an integer ⁇ ⁇ 0.
- the MAD-based cost volume estimation is more computationally efficient than other known cost volume estimation methods and accordingly synergistically helps to reduce run times of the flow estimation of an AI-based compression pipeline.
- the present inventors have found that implementing the operations used to perform MAD calculations at the machine level has a number of downsides, particularly when the MAD calculations are implemented using convolutions. Specifically, an input tensor of arbitrary ⁇ dimensions is typically stored in non- contiguous blocks of memory.
- the values of the elements of ⁇ and ⁇ for one channel ⁇ of an input tensor may be stored in a first block of memory while the values of the elements ⁇ and ⁇ for another channel ⁇ may be stored in a second block of memory and so on.
- a MAD value is estimated that involves this multi-channel input tensor, the values of the elements of ⁇ and ⁇ are accessed from one of the memory blocks, the values of the elements of ⁇ and ⁇ are accessed from another of the memory blocks, and so on. This means the number of memory access operations can be very high even for relatively simple operations.
- the strided sum operation comprises depth or channel-wise grouped convolutions (i.e. convolutions applied in depth-wise groups, each group taking as inputs the values of spatial dimensions ⁇ and ⁇ stored in non-contiguous memory blocks). That is, the stride is in the depth (i.e. channel) dimension necessitating the access and retrieval of the values of the elements of ⁇ and ⁇ stored in separate memory blocks.
- the grouping of the convolutions is in the spatial dimensions rather than the depth dimension, the non-contiguous memory block problem still arises but now in the depth dimension.
- FIG. 12a A schematic of this strided sum based implementation of a MAD cost volume calculation is shown in Figures 12a, 12b and 12c. Illustrated in Figure 12a is a toy representation of input tensors 1200a, 1200b respectively associated with a first image and a second image.
- Each input tensor has 3 channels: channel 1 (Ch1), channel 2 (Ch2) and channel 3 (Ch3).
- a first step repeat interleaving 1201 is performed on the first image input tensor 1200a to produce a first intermediate output.
- an unfold convolution operation 1202 is performed on the second image input tensor 1200b to produce a second intermediate output.
- the unfold convolution operations are performed as group convolutions and accordingly each group is assigned its own memory block.
- an absolute difference is estimated between the intermediate outputs of the first step and the second step to produce a third intermediate output.
- the elements of the third intermediate step are stored in said respective, different memory blocks - in this case memory blocks 1, 2 and 3 (Mb1, Mb2, Mb3) to match the number of outputs
- a strided sum 1204 is performed on the estimated absolute differences stored in the respective memory blocks 1, 2, and 3 to produce the MAD output cost volume tensor. Again by virtue of the group convolutions of the strided sum 1204 and by virtue of the memory block locations, this operation requires non-contiguous memory blocks to be accessed for each of the convolutions of the strided sum 1204, as is illustrated in Figure 12c.
- the goal of the cost volume is to construct a spatial comparison operator that encodes some notion of how two patches are related. For a given pixel ⁇ , the cost volume thus has a measure of comparison between ⁇ and ⁇ for a collection of local offsets ⁇ .
- the present inventors have realised that any effective encoding of this information can suffice in AI-compression pipelines because the neural networks that make up the flow encoder and/or decoder and residual encoder and/or decoder, and indeed any other neural networks that make up the AI-compression pipelines of Figures 1-14, are able to learn to accommodate the encoding of this information, regardless of how it is represented.
- this may be the comparison of a pixel in a first image to pixels in a radius around a corresponding pixel coordinate in a second image.
- a radius ⁇ local cost volume one must compare a pixel to (2 ⁇ + 1)2 reference pixels (e.g., for radius 3 there are 49 reference pixels in a 7x7 block).
- a compressive encoding of the cost volume may look like: ⁇ ⁇ ⁇ ( ⁇ ), where ⁇ is an appropriately chosen random matrix with ⁇ ⁇ R ⁇ satisfying ⁇ ⁇ O(log ⁇ ). From this, the present inventors have realised that it is possible to build a learned mapping that replaces the above-described classical way, naive patch-wise comparison approach of estimating local cost volumes that require large number of operations.
- the learned mapping effectively bypasses the direct computation of the local cost volume and instead computes the lower dimensional compressively encoded cost volume ⁇ ⁇ ⁇ ( ⁇ ) directly.
- This compressively encoded cost volume, or a representation thereof produced by a post-processing step contains substantially the same information as a traditional cost volume but this information is provided in a lower dimensional representation that can still be passed to any subsequent, downstream components of the AI-compression pipeline that relies on cost volumes in the usual way. Examples of downstream components may include layers in the flow modules, the final estimation of flow, the warping in or after the flow module, and so on.
- the compressively encoded cost volume facilitates the estimation of image differences far more efficiently than traditional cost volume calculation methods.
- This approach is described hereinafter as Regional Kernel Absolute Deviation (RKADe) and allows for any MAD operations in the AI-compression pipeline to be substituted by RKADe operations which mitigate the memory interleaving issue described above in connection with using MAD and naive patch-wise comparisons for cost volume estimation, and generally provide a way to more efficiently estimate differences between images.
- RKADe Regional Kernel Absolute Deviation
- a mapping e.g. an input, an operation applied to that input, and an output
- the mapping can be almost identically represented in fewer dimensions.
- cost volumes are typically sparse in a transform of the spatial domain.
- cost volumes representing the difference between two images can be substantially simplified. For example, if a single pixel in one image is being compared to a pixel patch in a pixel radius of 3 around in a second image, this would entail 49 pixel comparison operations to obtain the cost volume associated with that pixel using traditional methods. It turns out that the vast majority of these pixel operations are related to redundant information when the input and/or output are sparse as there is little or no difference.
- the cost volume can more efficiently be estimated by applying a fixed or learned map to the input to produce an identical or almost identical cost volume.
- Applying a map e.g. a linear map through a number of convolution operations, is computationally efficient and fast and provides a lower-dimensional representation of otherwise the same information that would have been provided in a cost volume estimated using traditional methods.
- the compressive cost volumes that are computed using the RKADe approach come with an additional, substantial run-time saving, because any downstream tensor operations (TOPs) that take place are in the lower dimensions of the compressive cost volume compared to the higher dimensions of the traditional cost volume; and none of the steps comprising RKADe require grouped convolutions, slow memory access or array operations like de-interleaving.
- FIG 13 illustratively shows the above steps of RKADe.
- an illustrative RKADe workflow on a toy example comprises four elements: a feature map F, a region map R, a transform map T, and an absolute difference ⁇ .
- the feature map F defined by one or more layers comprising one or more filters defined by a plurality of weights, operates as a linear embedding of the input tensor into a feature space. It is a local embedding in the sense that it may comprise a 1x1 convolution. Efficient implementations of 1x1 convolutions on a wide variety of commercial CPUs, GPUs and NPUs are known to the skilled person.
- the feature map operation filters may comprise random weights, and does not need to be trained (although it is envisaged that it may be trained in some circumstances).
- the feature map operation weights may be randomly distributed weights, and the operation relies on the favourable properties of high-dimensional random embeddings to preserve local geometry.
- the feature map F operation filters are instantiated using random weights with a normalisation that makes it an isometry in expectation. That is, the feature map operation filters comprise weights that apply a transformation that, on average, preserves geometrical distances of the distribution it is applied to. In other words, the feature map operation preserves a norm of the inputs in expectation.
- a convolution defining a feature map is an isometry in expectation
- the norm of the input to which the convolution is applied is preserved in the output (in a probabilistic sense).
- These fixed, random weights of F and the isometry in expectation property of F effectively mean that, during inference, estimating differences between two images comprises applying a convolution with the random weights that correspond to the random weights the convolution was initialised with, rather than weights modified in some way during subsequent training.
- An example of a suitable random distribution of weights of F is any suitably initialised sub-gaussian distribution.
- initialised means initialised based on the shape of the input and/or output tensors.
- the region map R operation comprises a composition of 3x3 convolutions, optionally with intermediate non-linearities such as one or more ReLU maps or activations.
- the purpose of the region map R is to entangle, in an output pixel, the information present in a local patch about the same pixel in the input (reference) image.
- entangling information means combine information.
- the "radius" of the local patch is determined by the number of convolutions in R (e.g., 3 convolutions gives a patch radius of 7). Because R is comprised of 3x3 convolutions and optionally simple non-linearities, it is possible to use known, efficient 3x3 convolution implementations to run efficiently on a wide variety of commercial CPUs, GPUs and NPUs.
- R is instantiated using a weight normalisation that makes it an isometry in expectation.
- the transform map T operation serves as a post-embedding or post-processing of the feature embedding and the entangled patch information, permitting the effective comparison of the two.
- This transform map T operation can be a simple 1x1 convolution, thereby being local, linear, fast, and efficient to implement across a wide variety of commercially available CPUs GPUs and NPUs.
- T may be instantiated using a weight normalisation that makes it an isometry in expectation.
- the weights of the transform map T may correspond to those of the feature map F. Further, the number of input channels for F can na ⁇ vely be set to any positive integer.
- the number of output channels of F may match that of R and hence the number of input and output channels of R must be the same.
- the number of input channels of T do not need to match its number of output channels. Note for completeness that the number of input channels of F can be set na ⁇ vely because if the mathematical relationships of RKADe are to hold in practice, it is envisaged that there is sufficient dimensional relationship between the pixel radius (encoded by the number of layers of R) and the number of output channels of F. This ensures the shape of the objects being compared match when the absolute difference is subsequently calculated.
- the feature map F is applied to a first input tensor representation 1300a of a first image, for example a current frame ⁇ of a sequence of images, and a second input tensor representation 1300b of a second image for example a previous frame ⁇ ref of the sequence which may contain movement relative to the first image.
- a comparison will be made with pixels at coordinates around a predetermined radius of that coordinate in the second input tensor 1300b.
- An illustrative toy example of a 1 pixel radius around a center pixel coordinate is indicated with the dotted borders in Figure 7.
- the feature map convolution operation F is applied to the pixels of the first input tensor 1300a and the associated patches of the predetermined radius in the second input tensor 1300b.
- the region map R convolution operation is then applied to the output of the feature map convolution operation on the second input tensor 1300b.
- a transform map T convolution operation is then optionally applied to the intermediate outputs, before an absolute difference is estimated, resulting in the RKADe cost volume tensor. It will be appreciated from Figure 13 that none of the feature map F convolution, region map R convolution or transform map T operation require grouped convolutions.
- the intermediate outputs may be easily stored in contiguous memory blocks without requiring a large number of memory read and write operations to interleave or de-interleave the data in memory.
- cost volume estimations in the RKADe approach are substantially sped up compared to traditional, naive patch-wise comparison approaches.
- F, R and/or T may be kept fixed (e.g. F’s weights may be fixed and randomly distributed between minimum and maximum values, as described above), or may be trained. Keeping the maps fixed facilitates straightforward deployment by substituting any naive patch-wise comparisons in AI-compression pipelines, (e.g. a large number of MAD-based operations).
- F is a linear map operating pixel-wise (i.e.
- RKADe runs on hardware without the requirement of grouped convolutions, and thus solves the slow memory access or array operations like de-interleaving of cost volume approaches such as naive patch-wise MAD.
- F, R and/or T may be individually or separately trainable, for exam- ple by setting a "requires_grad_" PyTorch or similar flag in their respective code-level implementations to permit backpropagation through them.
- F is generally kept fixed while R and/or T are trainable.
- they may be trained in an end-to-end manner with the rest of the pipeline, whereby the weight values are simply additional parameters, on top of the other parameters of the pipeline, that may be updated during back-propagation. More specifically, it has been found that end-to-end training requires no special auxiliary loss terms to guarantee stability during training.
- the F, R and T maps are "plug-and-play" onto the rest of the AI compression pipeline during training and subsequent inference.
- student-teacher training of weights of the (optionally F), R and T maps is also effective and achieves good training stability out of the box without difficulty.
- Multiple approaches to student-teacher training are envisaged.
- the teacher may be set up to push the teaching towards a fine-grained level that represents similar features to classical cost volumes, or at a less granular level where the teacher comprises a flow network with MAD cost volumes, and the student comprises a flow network with RKADe cost volumes, with the loss based on the difference between these two.
- the teacher may be set up to push the training towards some other objective and may be set up accordingly.
- the weights of F, R and/or T may be frozen at different times during these stages by setting the "requires_grad_" flag appropriately at different training steps.
- R and T may be frozen with initialisation weights during the initial pre-training phase of the compression pipeline before being unfrozen and trained during the main training phase. This approach ensures that the weights of R and T are being updated based on a stronger training signal from the rest of the neural networks of the compression pipeline in order to decrease overall convergence time, thereby speeding up training.
- the distribution i.e.
- the initialisation weights of F, R and/or T may be random, based on some predetermined distribution, or based on prior knowledge obtained from experimentation to provide a warm-started signal from the rest of the model at the point when one or more of F, R and/or T unfreeze and become trainable.
- the initialisation weights may be initialised using any appropriately normalised sub-gaussian distribution producing a map that is an isometry in expectation (for example, a Gaussian distribution, a truncated Weibull distribution, or a uniform distribution).
- the property of being an isometry in expectation is maintained as the weights of F, R and/or T as applicable are adjusted.
- This property may be enforced using Jacobian regularisation, such as that described in concept 5 below.
- the isometry preserving property may only be retained upon initialisation, and training will gradually eliminate that property as the weights converge to some final values.
- some of the weights of F, R, and/or T may be kept fixed, while others may be trained, for example to avoid significant departure from the isometry preserving property during training.
- the weights of F, R and/or T may be initialised by generating randomly distributed values uniformly distributed between a minimum value and a maximum value, whereby the minimum and maximum value are based on a number of input or output channels on which F, R and/or T are applied, based on a kernel size of F, R and/or T and/or on a pixel radius across which the RKADe cost volume is to be calculated (e.g. a 3 pixel radius resulting in a comparison against a 7x7 pixel patch centered on a given pixel coordinate).
- the minimum and maximum values are based on the dimensions of the input tensor.
- a filter comprising weights distributed as above preserves norm in expectation of the input to which it is applied.
- Jacobian regularisation may be applied during training to ensure the isometry in expectation is preserved even as the weights are updated. Alix, the weights may be free to lose this property if the training results in them doing so.
- the training regime may be implemented with one or more switches that specify exactly when during training to freeze or unfreeze the trainable parameters of F, R and/or T based on some predetermined one or more conditions (such as a number of iterations or a loss threshold, or other conditions).
- some predetermined one or more conditions such as a number of iterations or a loss threshold, or other conditions.
- naive patch-wise comparisons are slow on typical commercial hardware.
- the patch-wise comparisons introduce a bottleneck to run time. This is particularly problematic for technical domains where low-latency and fast run times are critical to functionality. Accordingly run-time advantages are realised for any use case that is presently implemented with patch-wise comparisons (using any measure of similarity), for example any use cases currently implemented as a MAD-based patch-wise comparison.
- RKADe is faster than naive patch-wise comparison operations, in any computer vision and/or image processing task because it is a compressive approximation of such a patch-wise comparison.
- a first example use case where run-time improvements may be realised with RKADe is in the generation of bounding boxes around image patches where movement is to be detected. Consider a first image at a first time and a second image temporally separated from the first image. The objects in the second image have moved relative to their positions in the first image. In computer vision tasks such as surveillance, satellite image comparisons, drone navigation, and others, detecting such movement is a common task as it facilitates tracking of objects across different views and across time.
- One approach to such detection is to generate a bounding box around objects whose pixels differ between the first and second images.
- One approach to generating such bounding boxes is to divide the first and second images into grids, and to compare the pixel values in the first image to individual pixels or groups of pixels in the second image.
- One way to make this comparison is to use MAD implemented as convolutions. If the MAD for a given pixel or patch of pixels exceeds a threshold, that pixel or patch of pixels may be identified as a movement-containing patch (whereas any that don’t exceed the threshold may be identified as static patches). The boundaries of one or more bounding boxes may then be generated that encompass some or all of these movement-containing patches and used to identify the moving object across frames.
- the bounding box generation approach may also be used in image and video compression pipelines including both traditional and AI-based compression pipelines. In this case, it may be used to facilitate partial frame-skipping to further reduce the amount of data that needs to be sent to reconstruct an image or image sequence. For example, if objects in two images have hardly changed except for a small number of pixels or pixel patches, the above-described bounding box generation approach may be used to identify and extract those movement-containing patches. Only these movement-containing patches then need to be compressed and transmitted in order for the full image sequence to be accurately reconstructed. That is, on the decode side, the original image sequence can be constructed efficiently by stitching together previously decoded static image patches with the newly received movement-containing patches.
- RKADe to estimating differences between pixel patches instead of a naive patch-wise comparison facilitates a substantial run-time improvement.
- Other non-limiting use cases where applying RKADe instead of naive patch-wise comparison results in run time improvement include: Image Registration: In medical imaging or remote sensing, images taken at different times or from different sensors need to be aligned or "registered" to each other.
- the alignment or registering of images to each other with naive patch-wise comparison (for example where MAD is used as a similarity metric to align these images accurately by finding the transformation that minimizes the average absolute intensity differences between them) can be replaced by the present RKADe approach for run-time improvements.
- Stereo Vision and Depth Estimation When calculating depth from stereo images, naive patch-wise comparisons using MAD are typically used to compare corresponding patches in the left and right images. The disparity (difference in horizontal position) that minimizes the MAD is often chosen as the correct match, which is then used to compute depth information. Here, the replacement of the naive, MAD-based patch-wise comparison results in run-time improvements in calculating depth.
- Template Matching In object detection and computer vision, template matching involves sliding a template image over a target image to find the region that best matches the template. A naive, patch-wise comparison, MAD-based is typically used as a measure to find the location where the template and the target image have the least absolute difference, indicating a potential match.
- Noise Reduction In image denoising, MAD can be used to compare the local neighborhood of pixels. Filters like the median filter or adaptive filters use MAD to determine the level of noise in a local patch and to adjust the filtering strength accordingly to reduce noise while preserving details. This initial determination of noise levels in a local patch can be achieved faster by applying the RKADe approach.
- Quality Assessment For quality control in manufacturing, naive patch-wise comparisons are be used to compare images of a product against a standard reference image. Differences beyond a certain threshold can indicate defects or deviations from the desired quality. Again, this may instead be implemented with RKADe to provide run time speed ups and the facilitation of running quality control algorithms on edge devices that do not have significant computing power.
- naive patch-wise comparison In satellite imagery analysis, naive patch-wise comparison can be used to detect changes over time by comparing pixel intensities of the same location across different dates. This is useful in monitoring urban development, deforestation, or the effects of natural disasters.
- the use of RKADe instead of a naive patch-wise comparison faciliates the running of such change detection algorithms more quickly and thus allowing real-time change detection on resource-constrained devices.
- Photogrammetry In reconstructing 3D models from 2D images, naive patch-wise comparisons can be used to ensure that the matching of pixels across multiple images is accurate, which is crucial for generating a reliable 3D representation.
- Some non-limiting factors that influence training stability include the choice of the optimization algorithm (e.g. stochastic gradient descent, momentum, adagrad, adam, and so on), the learning rate, the initialization of network weights, the network architecture, the quality and pre-processing of the input data and so on.
- the optimization algorithm e.g. stochastic gradient descent, momentum, adagrad, adam, and so on
- the learning rate e.g. stochastic gradient descent, momentum, adagrad, adam, and so on
- the learning rate e.g. stochastic gradient descent, momentum, adagrad, adam, and so on
- the initialization of network weights e.g. stochastic gradient descent, momentum, adagrad, adam, and so on
- the learning rate e.g. stochastic gradient descent, momentum, adagrad, adam, and so on
- the initialization of network weights e.g. stochastic gradient descent, momentum
- a Jacobian regularisation term or penalty in the context of training neural networks is a method used to control or influence the behavior of the model by regularizing its sensitivity to input changes.
- the Jacobian matrix represents the partial derivatives of the model’s outputs with respect to its inputs, effectively capturing how changes in the input affect changes in the output.
- these partial derivatives are the network’s Jacobian matrix.
- a norm of the Jacobian matrix is calculated and added to the loss function.
- Present concept 5 is directed to solving this problem by introducing a special type of Jacobian penalty term into the loss function.
- ⁇ ⁇ 1 be an integer.
- S ⁇ 1 the ( ⁇ ⁇ 1)-sphere.
- [ ⁇ ] the ordered set ⁇ 1, 2, .. . , ⁇ .
- ⁇ : ( ⁇ (1) , .. . , ⁇ ( ⁇ )) :
- R ⁇ ⁇ R ⁇ be an almost-everywhere continuously differentiable X ⁇ R ⁇ is compact.
- ⁇ may be a frame at time ⁇
- ⁇ 1 may be a frame at time ⁇ ⁇ 1
- ⁇ 1 may be a previously reconstructed frame from time ⁇ ⁇ 1
- ⁇ may be the function corresponding to the neural networks of the AI-based compression pipeline that we are training.
- the training objective becomes the task of learning weights that produce a fixed point set for mappings that act on low-motion input frames, effectively producing a network that acts like an identity operator on a previously decoded frame when the current frame is substantially the same as or similar to the previously decoded frame.
- the desired fixed point manifold can be well approximated by a deep-learning based temporal sequence model permitting stable long-term temporal dependence modelling with recurrent implementations.
- an ⁇ -Lipschitz function ⁇ that is differentiable at a point ⁇ 0 satisfies, for a unit direction vector ⁇ : ⁇ [ ⁇ ] ( ⁇ 0) ⁇ ⁇ .
- the domain X is compact and that ⁇ is everywhere continuous and almost everywhere continuously differentiable on X.
- Our goal is to encourage the residual decoder to learn weights that perfectly reconstruct a previously decoded frame when the current frame is substantially identical (i.e. no motion between the frames of the sequence).
- the residual decoder of the actual pipeline receives as input a latent tensor, an optical flow information tensor (e.g.
- the input space does not match the output space (i.e. the number of variables, the forms, shapes and dimensions of the inputs and outputs are different because the output is only the reconstructed frame and not the latent tensor).
- a Jacobian penalty based on this (i.e. based on or approximated from a matrix of partial derivatives of how output changes with respect to changes in the input), and add this Jacobian penalty to the loss function, the weights will not generally converge to values that produce our goal of a residual decoder that perfectly reconstruct the previously decoded frame when the current frame is substantially identical.
- auxiliary function that: (i) is based on the residual decoder i.e. a function that operates identically to the residual decoder during a forward pass and accordingly similarly receives as inputs a latent tensor and an optical flow information tensor in the form of a warped input image, but (ii) now not only returns the reconstructed image, but also returns as output the original input latent tensor.
- the input space latent tensor and warped input image
- the output space latent tensor and reconstructed image
- the latent tensor When we subsequently calculate the Jacobian penalty from this function, the latent tensor will appear as both a variable considered an input and a variable considered an output when the partial derivatives of the Jacobian matrix are being estimated or approximated. In layman’s terms, the effect of this is the moderation of significant changes to the latent tensor as sequences progress. In more formal terms, the above-described mathematical relationships apply and the weights converge to values during training that, during inference, exhibit the behaviour of perfectly reconstructing a previously decoded frame when the current frame is substantially identical.
- Algorithm 1 AI-based compression network training with residual decoder auxiliary mapping Jacobian penalty Inputs T raining dataset X, learning rate ⁇ , regularization parameter ⁇ , number of epochs ⁇ , network architecture ⁇
- Compute loss Loss ⁇ + ⁇ + ⁇ aux Backward pass to compute gradients ⁇ Loss
- a learning rate ⁇ , regularisation parameter ⁇ , and a number of training steps or epochs ⁇ is selected.
- the network architecture of ⁇ is defined, for example as shown in Figure 3 and Figure 14.
- the network parameters ⁇ are randomly initialised and then the training loop is started.
- the forward pass is computed for the network being trained ⁇ , as well as the auxiliary function.
- a Jacobian penalty ⁇ ( ⁇ , ⁇ 1 ⁇ ⁇ , ⁇ ) is then estimated as described above, and the total ⁇ is calculated by combining a distortion term ⁇ , a rate term ⁇ , and the estimated Jacobian penalty.
- the backwards pass is then performed to compute gradients based on the loss, and the parameters ⁇ are optimised using the optimiser, such as stochastic gradient descent SGD, or some other known optimiser.
- a validation loss can be calculated and the training loop is repeated until the predetermined number of steps or epochs ⁇ have been calculated, or some other criteria have been reached.
- the learning rate, batch size, and or number of epochs may be optimised during training, for example using a learning rate scheduler or some other hyperparameter optimisation method. More generally, the hyperparameters may be optimised experimentally.
- the Jacobian penalty ⁇ aux( ⁇ , ⁇ 1 ⁇ ⁇ , ⁇ ) is based on an auxiliary mapping function that includes the latent tensor ⁇ on both sides of the mapping so that the input space matches the output space, thus resulting in convergence to a set of weights that exhibit the desired temporally stability for frame sequences.
- the Jacobian penalty term is shown to be based on the mapping ( ⁇ , ⁇ 1 ⁇ ⁇ , ⁇ ), it is envisaged that, each of these variables may be processed in some way without the general applicability of the above-described general mathematical properties of the input space matching the output space.
- the mapping may be based on this warped image e.g. ⁇ 1,warped, and so on.
- a learning rate ⁇ , regularisation parameter ⁇ , and a number of training steps or epochs ⁇ is selected.
- the network architecture of ⁇ is defined, for example as shown in Figure 3 and Figure 14.
- the network parameters ⁇ are randomly initialised and then the training loop is started.
- the forward pass is computed for network being trained ⁇ , as well as the auxiliary function.
- a Jacobian penalty ⁇ ( ⁇ , ⁇ ⁇ ⁇ , ⁇ ) is then estimated as described above, and the total ⁇ is calculated by combining a distortion term ⁇ , a rate term ⁇ , and the estimated Jacobian penalty.
- Some or all of the terms may be regularised by some constant ⁇ 1 and/or ⁇ 2.
- the backwards pass is then performed to compute gradients based on the loss, and the parameters ⁇ are optimised using the optimiser, such as stochastic gradient descent SGD, or some other known optimiser.
- a validation loss then optionally be calculated and the training loop is repeated until the predetermined number of steps or epochs ⁇ have been calculated, or some other criteria has been reached.
- the learning rate, batch size, and or number of epochs may be optimised during training, for example using a learning rate scheduler or some other hyper parameter optimisation method. More generally, the hyper parameters may be optimised experimentally.
- ⁇ , ⁇ and ⁇ can be any variables or set of variables from an AI-based compression pipeline including, but not limited to one or more of: a current input image: ⁇ , a reference input image: ⁇ 1 or ⁇ +1, a reference previously reconstructed image (warped or unwarped): ⁇ 1, ⁇ or ⁇ +1, ⁇ 1,warped, ⁇ ,warped or ⁇ +1,warped, a flow: ⁇ , and/or any latent on, including in using any of these variables in any combination in upsampled and downsampled spaces as applicable.
- auxiliary mapping from which to estimate a Jacobian penalty by using, for example: ⁇ ( ⁇ ⁇ ⁇ , ⁇ 1 ⁇ ⁇ ⁇ ⁇ ⁇ , ⁇ ) T hat is, the reconstructed flow latent is provided on both sides of the auxiliary mapping from which the Jacobian penalty is calculated.
- the network weights converge to a set of values where flow is encouraged to be reconstructed in a way that varies little between across a sequence of frames. Effectively the networks are learning the set of weights that have a fixed point for the desired auxiliary mapping behaviour.
- a similar approach can be taken for any and all of the example variables referred to above to encourage the networks of the AI-based compression pipeline to learn how to behave in a temporally consistent way for frame sequences. It is also envisaged that multiple different Jacobian penalties based on such mappings may be introduced into the loss term, for example ⁇ 1 , ⁇ 2 , ..., ⁇ as applicable to encourage the learning of temporally consistent behaviour by any number of components of the AI-based compression pipeline.
- any weighted norm of any such Jacobian penalty or penalties may be incorporated into the AI-based compression pipeline.
- a weighted norm of the Jacobian multiplies the components of the matrix using weights, computed in some manner relevant to the task at hand, such that the resulting "weight and norm" operation still satisfies the mathematical definition of being a (quasi) norm.
- weights may be computed for example as a function of the amount of motion information that is present in the image; or according to metrics that define the presence of occlusion between two frames.
- the motion information may be captured indirectly in the flow information estimated by the flow module of the pipeline, or by some other measure such as, for example, a direct pixel difference calculation between the pixels of two or more images.
- the above-described approach can increase network training times significantly, for example 30% or more.
- the presently described Jacobian penalties may be estimated using the following approach. It will be understood that the Frobenius norm of a square matrix ⁇ ⁇ R ⁇ also controls its operator norm.
- ⁇ : ⁇ [ ⁇ ] ( ⁇ ) ⁇ [ ⁇ ] ( ⁇ ) and observe that the above equation provides an l-sample estimate of the Jacobian-vector product of a function ⁇ .
- the Jacobian penalty may be calculated by estimating the partial derivatives of the network’s outputs with respect to its inputs, which together form a Jacobian matrix.
- the penalty may be estimated by calculating the norm of this matrix (or the trace of a related matrix - as is known in the art).
- calculating the norm (or trace, as applicable) of the Jacobian matrix is computationally very expensive. Instead we use an approach based on Hutchison’s trace estimator and finite differences. In the specific case of AI-based compression pipelines, it turns out that we can get a good trace estimate by making a 1-sample approximation.
- the 1-sample approximation using finite differences facilitates the estimation of a Jacobian penalty in a way that is significantly faster than traditional methods and facilitates training with multiple Jacobian penalties without significantly increasing training times. That is, the time attributed to estimating the Jacobian penalty is an insignificant fraction of the overall training time. This in turn enables training with Jacobian penalties applied to multiple components of the pipeline to encourage temporally stable behaviour while keeping overall training times substantially the same thereby reducing overall cost per training run.
- the Jacobian penalty described above can itself be regularised based on a property of the frames of the sequence being trained on. That is, the Jacobian penalty may be made "motion-aware" by weighting it according to a property of the frames of the sequence being trained on such as how much movement there is between frames. This movement may be captured indirectly in the Jacobian matrix from which the Jacobian penalty is calculated and accordingly the penalty may comprise a weighted norm ⁇ ( ⁇ aux) where the mapping ⁇ may comprise e.g. Frobenius norm, a spectral norm, or any other norm.
- the weighting may scale the Jacobian penalty based on the combined strength of all the partial derivatives it contains, which will be higher in high motion frame sequences.
- the movement may be captured directly and the mapping that encodes the weighting ⁇ of the Jacobian may be based on e.g. pixel differences, MSE, or some other measure of differences between the frames at time ⁇ and some other time ⁇ ⁇ 1.
- MSE pixel differences
- the idea is that if the motion between two frames ⁇ and ⁇ 1 is large, then we want the Jacobian penalty described above to be dampened to a lower value.
- a learning rate ⁇ , regularisation parameter ⁇ , and a number of training steps or epochs ⁇ is selected.
- the network architecture of ⁇ is defined, for example as shown in Figure 3 and Figure 14.
- the network parameters ⁇ are randomly initialised and then the training loop is started.
- the forward pass is computed for the network being trained ⁇ , as well as the auxiliary function.
- a Jacobian penalty ⁇ ( ⁇ , ⁇ ⁇ ⁇ , ⁇ ) is then estimated as described above, and the total Loss is calculated by combining a distortion term ⁇ , a rate term ⁇ , and the estimated Jacobian penalty that in this case is weighted by ⁇ .
- the loss term may be regularised using e.g.
- the backwards pass is then performed to compute gradients based on the loss, and the parameters ⁇ are optimised using the optimiser, such as stochastic gradient descent SGD, or some other known optimiser.
- a validation loss may be calculated and the training loop is repeated until the predetermined number of steps or epochs ⁇ have been calculated, or some other criteria have been reached.
- the learning rate, batch size, and or number of epochs may be optimised during training, for example using a learning rate scheduler or some other hyper parameter optimisation method. More generally, the hyper parameters may be optimised experimentally.
- the Jacobian penalty here is weighted by a function ⁇ , where ⁇ is defined to act element-wise on the object returned by a Jacobian-vector product between the Jacobian matrix and some random vector (in practice a tensor) such that the weighted norm induced by ⁇ satisfies the mathematical definitions of a quasi-norm or pseudo-norm.
- ⁇ is defined to act element-wise on the object returned by a Jacobian-vector product between the Jacobian matrix and some random vector (in practice a tensor) such that the weighted norm induced by ⁇ satisfies the mathematical definitions of a quasi-norm or pseudo-norm.
- a further difficulty that arises when training using the above-described Jacobian penalty is its interactions with different phases of training and different training schedules. For example, the present description has been framed in the context of training on sequences of frames, the specific number of frames in a given sequence or "group of pictures" (GOP) used
- a GOP is typically considered to be an I-frame and ⁇ P- or B-frames.
- a training schedule that starts training on short GOPs of 5-6 frames, and then after a predetermined number of steps switches to training on longer GOPs of 7 or more frames e.g. 8, 9, 10, 20, 30, 40, 50 frames or more.
- the Jacobian penalty described herein is ideally suited for encouraging the networks to be performant on said longer GOPs but can in some cases act as noise when performing the initial training on the shorter GOPs. This may be, for example, because the temporal consistency is less of a problem for shorter GOPs and so the Jacobian penalty actually weakens the strength of the training signal.
- Jacobian regularisation serves as an inductive bias to improve the network’s generalisation performance to sequences with GOP sizes that are significantly greater than that seen in training. Indeed, if the network is only evaluated on sequences with the same GOP seen in training, for example short GOP sequences, then temporal stability can be less problematic. However, it follows that, for temporal stability arising solely from training on different GOP sizes to be present in the wild, the training data necessary to achieve this would have to contain equal numbers of samples of all GOP sizes distributed equally across all video sequences and so on - something which is burdensome to obtain in real world settings. The presently described approach accordingly facilitates obtaining the same effect but without needing as complete a training data set.
- the present disclosure faclitates the generalisation of performance in the sense of temporal stability for large GOP sizes — specifically those that are significantly larger than what’s seen during training, irrespective of what GOP sizes are in the training data.
- the presently described Jacobian penalty term may be introduced only at a predetermined time or times (e.g. in terms of number of training steps or consequential to changing training frame sequence length) during a training schedule.
- FIG 14 illustrates a further example of a flow-residual compression pipeline, such as that of Figure 3, whereby the representation of flow information that the residual decoder receives as input comprises a warped previously decoded image.
- this architecture corresponds generally to the flow-residual compression pipeline shown in Figure 3 it accordingly uses the same reference numbers for corresponding features.
- the flow module 1400 is shown to comprise a flow encoder 1401 that produces a latent representation of optical flow information ⁇ ⁇ ⁇ ⁇ which is decoded by a flow decoder 1402 into a reconstructed flow ⁇ which in turn is used to warp a previously reconstructed image ⁇ 1 to produce ⁇ 1, ⁇ which in turn is fed into the residual decoder 1413 as the representation of optical flow information.
- the residual decoder neural network 1413 uses a latent representation ⁇ of the current frame ⁇ , and the warped previously reconstructed image ⁇ 1 to produce the reconstructed current frame ⁇ .
- the Jacobian penalty term described above may be implemented by constructing an auxiliary function that copies the operations of the residual encoder 1411 and/or residual decoder 1413 of the residual part 1410 including using the same inputs as the residual decoder 1413, but the auxiliary function also returns as an output the latent representation ⁇ that was one of its inputs, effectively simply passing the latent representation ⁇ 1412 directly through the function.
- the Jacobian penalty based on this auxiliary function with the latent being both an input and an output has the desired mathematical properties to encourage convergence to a set of weights that produce a network that behaves in a temporally stable manner for sequences of frames.
- the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
- the computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
- the computer storage medium is not, however, a propagated signal.
- data processing apparatus encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- the apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program may, but need not, correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code.
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit.
- a central processing unit will receive instructions and data from a read only memory or a random access memory or both.
- the essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- a computer need not have such devices.
- a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable devices, that includes one or more processors and computer readable media, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
- PDA personal digital assistant
- GPS Global Positioning System
- Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
- semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
- magnetic disks e.g., internal hard disks or removable disks
- magneto optical disks e.g., CD ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in multiple examples separately or in any suitable subcombination.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
A method of training one or more neural networks for use in lossy image or video encoding, transmission and decoding. The method comprises receiving an input image at a first computer system; downsampling the input image with a downsampler to produce a downsampled input image; encoding the downsampled input image using a first neural network to produce a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; upsampling the output image with an upsampler to produce an upsampled output image; evaluating a function based on a difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image.
Description
Method and data processing system for lossy image or video encoding, transmission and decoding BACKGROUND This invention relates to a method and system for lossy image or video encoding, transmission and decoding, a method, apparatus, computer program and computer readable storage medium for lossy image or video encoding and transmission, and a method, apparatus, computer program and computer readable storage medium for lossy image or video receipt and decoding. There is increasing demand from users of communications networks for images and video content. Demand is increasing not just for the number of images viewed, and for the playing time of video; demand is also increasing for higher resolution content. This places increasing demand on communications networks and increases their energy use because of the larger amount of data being transmitted. To reduce the impact of these issues, image and video content is compressed for transmission across the network. The compression of image and video content can be lossless or lossy compression. In lossless compression, the image or video is compressed such that all of the original information in the content can be recovered on decompression. However, when using lossless compression there is a limit to the reduction in data quantity that can be achieved. In lossy compression, some information is lost from the image or video during the compression process. Known compression techniques attempt to minimise the apparent loss of information by the removal of information that results in changes to the decompressed image or video that
is not particularly noticeable to the human visual system. JPEG, JPEG2000, AVC, HEVC and AVI are examples of compression processes for image and/or video files. In general terms, known lossy image compression techniques use the spatial correlations between pixels in images to remove redundant information during compression. For example, in an image of a blue sky, if a given pixel is blue, there is a high likelihood that the neighbouring pixels, and their neighbouring pixels, and so on, are also blue. There is accordingly no need to retain all the raw pixel data. Instead, we can retain only a subset of the pixels which take up fewer bits and infer the pixel values of the other pixels using information derived from spatial correlations. A similar approach is applied in known lossy video compression techniques. That is, spatial correlations between pixels allow the removal of redundant information during compression. However, in video compression, there is further information redundancy in the form of temporal correlations. For example, in a video of an aircraft flying across a blue-sky background, most of the pixels of the blue sky do not change at all between frames of the video. The most of the blue sky pixel data for the frame at position t = 0 in the video is identical to that at position t = 10. Storing this identical, temporally correlated, information is inefficient. Instead, only the blue sky pixel data for a subset of the frames is stored and the rest are inferred from information derived from temporal correlations. In the realm of lossy video compression in particular, the removal of redundant temporally correlated information in a video sequence is as known inter-frame redundancy.
One technique using inter-frame redundancy that is widely used in standard video compression algorithms involves the categorization of video frames into three types: I-frames, P-frames, and B-frames. Each frame type carries distinct properties concerning their encoding and decoding process, playing different roles in achieving high compression ratios while maintaining acceptable visual quality. I-frames, or intra-coded frames, serve as the foundation of the video sequence. These frames are self-contained, each one encoding a complete image without reference to any other frame. In terms of compression, I-frames are least compressed among all frame types, thus carrying the most data. However, their independence provides several benefits, including being the starting point for decompression and enabling random access, crucial for functionalities like fast-forwarding or rewinding the video. P-frames, or predictive frames, utilize temporal redundancy in video sequences to achieve greater compression. Instead of encoding an entire image like an I-frame, a P-frame represents the difference between itself and the closest preceding I- or P-frame. The process, known as motion compensation, identifies and encodes only the changes that have occurred, thereby significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on previous frames for decoding. Consequently, any error during the encoding or transmission process may propagate to subsequent frames, impacting the overall video quality. B-frames, or bidirectionally predictive frames, represent the highest level of compression. Unlike P-frames, B-frames use both the preceding and following frames as references in their encoding process. By predicting motion both forwards and backwards in time, B-frames
encode only the differences that cannot be accurately anticipated from the previous and next frames, leading to substantial data reduction. Although this bidirectional prediction makes B-frames more complex to generate and decode, it does not propagate decoding errors since they are not used as references for other frames. Artificial intelligence (AI) based compression techniques achieve compression and decom- pression of images and videos through the use of trained neural networks in the compression and decompression process. Typically, during training of the neutral networks, the difference between the original image and video and the compressed and decompressed image and video is analyzed and the parameters of the neural networks are modified to reduce this difference while minimizing the data required to transmit the content. However, AI based compression methods may achieve poor compression results in terms of the appearance of the compressed image or video or the amount of information required to be transmitted. An example of an AI based image compression process comprising a hyper-network is described in Ballé, Johannes, et al. “Variational image compression with a scale hyperprior.” arXiv preprint arXiv:1802.01436 (2018), which is hereby incorporated by reference. An example of an AI based video compression approach is shown in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference.
A further example of an AI based video compression approach is shown in Mentzer, F., Agustsson, E., Ballé, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural video compression using gans for detail synthesis and propagation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI (pp. 562-578), which is hereby incorporated by reference. Figure 3 of which shows an architecture that calculates optical flow with a flow model, UFlow, and encodes the calculated optical flow with a flow encoder, Eflow. SUMMARY According to an aspect of the present disclosure, there is provided a method of training one or more neural networks for use in lossy image or video encoding, transmission and decoding. The method comprises receiving an input image at a first computer system; downsampling the input image with a downsampler to produce a downsampled input image; encoding the downsampled input image using a first neural network to produce a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; upsampling the output image with an upsampler to produce an upsampled output image; evaluating a function based on a difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image; updating the parameters of the first neural network and the second neural network based on the evaluated function; and
repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. Optionally, the method described above may comprise a third neural network for upsampling, and wherein the method may include updating the parameters of said third neural network based on the evaluated function. Optionally, the method described above may comprise a downsampler configured for either bilinear or bicubic downsampling. Optionally, the method described above may comprise a Gaussian blur filter in the downsampler. Optionally, method according to any one as described above may comprise (i) updating the parameters of the first neural network and the second neural network based on the evaluated function for a first number of steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the third neural network, (ii) freezing the parameters of the first and second neural networks after said first number of steps, and performing said upsampling and downsampling, and said updating of the parameters of the third neural network for a second number of said steps. Optionally, the method described above may comprise a fourth neural network in the down- sampler, and may further include updating the parameters of said fourth neural network based on the evaluated function.
Optionally, the method described above may comprise an upsampler configured for either bilinear or bicubic upsampling. Optionally, method as described above may comprise (i) updating the parameters of the first neural network and the second neural network based on the evaluated function for a first number of said steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the fourth neural network, (ii) freezing the parameters of the first and second neural networks after said number first of steps, and performing said downsampling and said updating of the parameters of the fourth neural network for a second number of said steps. Optionally, the method described above may comprise entropy encoding the latent representation into a bitstream having a length, wherein the function is further based on said bitstream length, and wherein said updating the parameters of the third or fourth neural network is based on the evaluated function based on the bitstream length. Optionally, the method described above may comprise determining the difference between one or more of the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image based on the output of a fifth neural network acting as a discriminator. Optionally, the method of any as described above may comprise calculating the difference between one or more of: the output image and the input image, the output image and the
downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image. The difference is expressed in terms of a mean squared error (MSE) and/or a structural similarity index measure (SSIM). Optionally, the method described above may comprise a term defining a visual perceptual metric. Optionally, the method described above may comprise a visual perceptual metric, wherein the term defining the metric comprises a MS-SIM metric. According to an aspect of the present disclosure, there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding. The method comprises receiving an input image at a first computer system, encoding the input image using a first neural network to produce a latent representation, decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image, upsampling the output image with an upsampler to produce an upsampled output image, the upsampler comprising a third neural network, evaluating a function based on a difference between one or both of: the output image and the input image, and/or the upsampled output image and the input image, updating the parameters of the third neural network based on the evaluated function, and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.
According to an aspect of the present disclosure, there is provided a method of training one or more neural networks for use in lossy image or video encoding, transmission and decoding. The method comprises receiving an input image at a first computer system, downsampling the input image with a downsampler comprising a fourth neural network to produce a downsampled input image, encoding the downsampled input image using a first neural network to produce a latent representation, decoding the latent representation using a second neural network to produce an output image approximating the input image. Further steps involve evaluating a function based on differences between various images and updating the parameters of the fourth neural network based on the evaluated function. This process is repeated using a first set of input images to produce a first trained neural network and a second trained neural network. Optionally, the method described above may comprise producing the previously downsampled input image by performing bilinear or bicubic downsampling on the input image. According to an aspect of the present disclosure, there is provided a method for lossy image or video encoding, transmission and decoding. The method comprises the steps of receiving an input image at a first computer system, downsampling the input image with a downsampler, encoding the downsampled input image using a first trained neural network to produce a latent representation, transmitting the latent representation to a second computer system, decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image, and upsampling the output image with an upsampler to produce an upsampled output image.
According to an aspect of the present disclosure, there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; transmitting the latent representation to a second computer system; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein one or more of the above steps comprises performing a downsampling or upsampling operation, and wherein the downsampling or upsampling operation comprises performing one or convolution operations without performing a space-to-depth or depth-to-space operation. Optionally, the method described above may comprise performing the downsampling or upsampling operation on a CPU without performing a space-to-depth or depth-to-space operation, and wherein said downsampling or upsampling is configured to be performed in real-time or near real-time. The method as described above may optionally comprise performing the downsampling or upsampling operation on a neural accelerator without performing a space-to-depth or depth-to-space operation, and wherein said downsampling or upsampling is configured to be performed in real-time or near real-time. The method as described above may optionally comprise a downsampling operation that includes applying one or more convolutional layers with a kernel size based on a downsampling factor. These convolutional layers are configured to sequentially reduce the spatial dimensions of an input while increasing the depth or channel dimension of the input.
Optionally, the method described above may comprise an input image. Optionally, the method may comprise a tensor representation of the input image. Optionally, the method described above may comprise a downsampling operation performed by applying one or more convolutional layers configured with a stride equal to the downsampling factor. The number of filters in each convolutional layer is based on the original number of channels in the input tensor and on the downsampling factor. Optionally, the method may include performing a first convolution and second deconvolution. This may further involve performing additional upsampling steps and utilizing additional layers such as Maxpool and Relu. Optionally, the method described above may comprise the input including a latent representation. Optionally, the method may comprise a tensor representation of the latent representation or the output image as part of its input. Optionally, the method as described may include upsampling layers having strides determined by an upsampling factor. Optionally, the method described above may further comprise applying an activation function after each convolutional layer in the upsampling operation.
Optionally, the method described above may include the upsampling layers being selected from a group consisting of nearest neighbor upsampling, bilinear upsampling, and bicubic upsampling, alternated with the convolutional layers. According to an aspect of the present disclosure, there is provided a method for lossy image or video encoding, transmission and decoding, comprising the steps of receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, wherein the optical flow information is indicative of a difference between a representation of the second image and a representation of the input image; transmitting the optical flow information to a second computer system; decoding the optical flow information using a second neural network; and using the second image and the decoded optical flow information to produce an output image, wherein the output image is an approximation of the input image. The estimating of optical flow information further comprises estimating differences between the input image and the second image by applying a first convolution operation on a one or more pixels of a representation of the input image and/or on one or more pixels of a representation of the second image, wherein the convolution operation comprises applying one or more filters comprising weights having values randomly distributed between a minimum and a maximum value. Optionally, the method comprises estimating a compressively encoded cost volume indicative of said differences by applying said first convolution operation.
Optionally, the first convolution substantially preserves a norm of a distribution of the respective pixels of the representation of the input image and/or respective pixels of the representation of the second image. Optionally, a distribution of values of pixels of the representation of the input image and/or the distribution of values of pixels of the representation of the second image are sparse distributions in a spatial domain of the representation of the input image and/or the second image. Optionally, a distribution of values of pixels of the representation of the input image and/or the distribution of values of pixels of the representation of the second image comprise sparse distributions in a spatial domain. Optionally, the method described above may comprise assigning weights with values distributed according to a sub-Gaussian distribution. Optionally, the method described above may comprise determining the minimum value and/or maximum value based on the number of channels of the input image and/or second image, kernel size of the first convolution operation, and/or pixel radius across which said differences are estimated. Optionally, the method described above may comprise performing a second convolution operation on an output of the first convolution operation, wherein the second convolution operation substantially preserves a norm of a distribution of said output of the first convolution operation.
Optionally, the method described above may comprise estimating a difference between an output of the second convolution operation and an output of the first convolution operation. Optionally, the difference may comprise an absolute difference. Optionally, the difference defines a cost volume. Optionally, the method described above may comprise using the optical flow information to warp a representation of the second image. Optionally, the method may involve estimating a difference between the warped second image and the input image in order to create a residual representation of the input image relative to the warped second image. Optionally, the method described above may comprise: (i) using a third neural network to encode the residual representation of the input image; (ii) transmitting the encoded residual representation of the input image to the second computer system; (iii) using a fourth neural network to decode the residual representation of the input image; and (iv) using decoded the residual representation of the input image to produce said output image. Optionally, the method described above may comprise applying a third convolution operation to an output of the first convolution operation and/or to an output of the second convolution operation.
Optionally, a kernel size of the second convolution operation is greater than a kernel size of the first convolution operation. Optionally, the first convolution operation is defined by a 1x1 kernel. Optionally, the second convolution operation is defined by a 3x3 kernel. Optionally, the third convolution operation is defined by a 1x1 kernel. Optionally, the method described above may comprise performing the second convolution operation to entangle information associated with respective pixels of the representation of the input image with information associated with pixels adjacent corresponding pixels in the representation of the second image. Optionally, the method described above may comprise a first, second, and where present third convolution operation, wherein these operations are performed without group convolutions. Optionally, one or more outputs of the first, second and/or third convolution operation are stored in contiguous memory blocks, and wherein estimating a difference comprises retrieving said stored outputs from said contiguous memory blocks. Optionally, a distribution of pixel values of the input image and of the second image are sparse and incoherent in a spatial domain and/or a transform of a spatial domain.
According to an aspect of the present disclosure, there is provided a system configured to perform any of the above methods. According to an aspect of the present disclosure, there is provided a method for lossy image or video encoding and transmission. The method includes receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, wherein the optical flow information is indicative of a difference between a representation of the second image and a representation of the input image; transmitting the optical flow information to a second computer system. In this method, estimating the optical flow information comprises estimating differences between the input image and the second image by by applying a first convolution operation on a one or more pixels of a representation of the input image and/or on one or more pixels of a representation of the second image, wherein the convolution operation comprises applying one or more filters comprising weights having values randomly distributed between a minimum and a maximum value. According to an aspect of the present disclosure, there is provided a method for lossy image or video decoding, comprising receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, wherein the optical flow information is indicative of a difference between a representation of the second image and a representation of the input image based on a compressively encoded cost volume; receiving optical flow information at a second computer system, wherein the optical flow information is indicative of a difference between a
representation of a second image and a representation of an input image; decoding the optical flow information using a second neural network; and using the second image and the decoded optical flow information to produce an output image, which approximates the input image. According to an aspect of the present disclosure, there is provided an apparatus configured to perform any of the above methods. According to an aspect of the present disclosure, there is provided a method for estimating a difference between a first image and a second image. The method comprises performing a first convolution operation on respective pixels of a representation of the first image and on respective pixels of a representation of the second image; and estimating a difference between the first image and the second image based on one or more outputs of the first convolution operation on the first and second images by estimating a compressively encoded cost volume indicative of said differences. Optionally, the method may include performing a second convolution operation on an output of the first convolution operation, and estimating a difference between an output of the second convolution operation and the first convolution operation. Optionally, the method described above may comprise performing a second convolution operation that entangles information associated with respective pixels of the representation of the first with information associated with pixels adjacent corresponding pixels in the representation of the second image.
Optionally, the method described above may include a first convolution operation where one or more filters are applied with weights having values randomly distributed between a minimum value and a maximum value. Optionally, the method described above may comprise determining the minimum value and/or maximum value based on the number of channels of the input image and/or second image, kernel size of the first convolution operation, and pixel radius across which said differences are estimated. Optionally, the method described above may comprise a difference comprising an absolute difference. Optionally, the method described above may comprise defining a cost volume based on the difference. Optionally, the method described above may comprise applying a third convolution operation to an output of the first convolution operation and/or to an output of the second convolution operation. Optionally, the method described may involve adjusting a kernel size of the second convolution operation to be larger than that of the first. Optionally, the method described above may comprise a first convolution operation defined by a 1x1 kernel.
Optionally, the method described above may include the step whereby the second convolution operation is defined by a 3x3 kernel. Optionally, the method described above may comprise the third convolution operation defined by a 1x1 kernel. Optionally, the method described above may comprise storing a plurality of respective outputs of the first, second and/or third convolution operations in contiguous memory blocks, and wherein estimating a difference comprises retrieving said stored outputs from said contiguous memory blocks. Optionally, the method described above may comprise using said difference to identify one or more pixel patches in the second image as movement-containing pixel patches, and generating a bounding box around one or more of said movement-containing pixel patches. According to an aspect of the present disclosure, there is provided a data processing apparatus configured to perform any of the above described methods. According to a method of the present disclosure, there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information, the optical flow information being indicative of a difference between the first image and the
second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; evaluating a function based on a difference between the first image and the second image, the function comprising a Jacobian penalty term, updating the parameters of the first, second and/or third neural networks based on the evaluated function; and repeating the above steps using a first set of input images to produce first, second and/or third trained neural networks. Optionally, the Jacobian penalty term is based on a rate of change of one or more first variables with respect to one or more second variables, the first variables and second variables selected from inputs and/or outputs associated with the one or more neural networks. Optionally, at least input and/or output associated with the one or more neural networks is both a first variable and a second variable. Optionally, the method comprises producing the second variables from the first variables by mapping the first variables to the second variables. Optionally, the mapping is defined by an auxiliary function.
Optionally, the first variables are inputs to the auxiliary function and the second variables are outputs of the auxiliary function. Optionally, at least one input of said inputs to the auxiliary function is also an output of the auxiliary function. Optionally, the inputs of said mapping are defined in an input space, and the outputs of said mapping are defined in an output space, and wherein the auxiliary function maps the input space to the output space. Optionally, the input space matches the output space. Optionally, the auxiliary function is based on the third neural network. Optionally, the third neural network comprises a residual decoder neural network. Optionally, the at least one input to the auxiliary function that is also an output of the auxiliary function comprises said latent representation of the first image. Optionally, the method comprises weighting the Jacobian penalty term. Optionally, said weighting is based on a difference between the first image and the second image.
Optionally, said weighting is defined by a weighted norm based on a matrix associated with said rate of change. Optionally, the method comprises estimating the Jacobian penalty term by approximating a norm of a matrix associated with said rate of change. Optionally, approximating the norm of the matrix comprises making a single sample approxi- mation. Optionally, the method comprises introducing the Jacobian penalty term into said function after a first number of said repeated steps. Optionally, said first number of said repeated steps is based on a GOP-size of one or more frame sequences in said first set of input images. According to an aspect of the present disclosure, there is provided, a method of performing lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system;
with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; wherein the first neural network, the second neural network, and the third neural network are produced according to any of the methods described above. According to an aspect of the present disclosure, there is provided, a method of performing lossy image or video encoding, transmission, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; wherein the first neural network,is produced according to any of the methods described above. According to an aspect of the present disclosure, there is provided, a method of performing lossy image decoding, the method comprising the steps of: receiving a latent representation of optical flow information at a second computer system;, the optical flow information being indicative of a difference between a first image and a second
image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; wherein the second neural network and the third neural network are produced according to any of the methods described above. According to an aspect of the present disclosure, there is provided, a method of performing lossy image or video decoding, the method comprising the steps of: with a second neural network, at a second computer system, decoding a latent represen- tation to produce a first output image, wherein the first output image is an approximation of one image of an image pair of a first sequence of input images; repeating the above step to produce a first sequence of output images, the first sequence of output images being an approximation of the first sequence of input images; wherein the second neural network is produced according to any of the methods described above. According to an aspect of the present disclosure, there is provided, a data processing apparatus configured to perform any of the above methods.
According to an aspect of the present disclosure, there is provided, a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of the above methods. According to an aspect of the present disclosure, there is provided, a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer carry out the method of any of the above methods. BRIEF DESCRIPTION OF THE DRAWINGS Aspects of the invention will now be described by way of examples, with reference to the following figures in which: Figure 1 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 2 illustrates a further example of an image or video compression, transmission and decompression pipeline including a hyper-network. Figure 3 illustrates an example of a video compression, transmission and decompression pipeline. Figure 4 illustrates an example of a video compression, transmission and decompression system.
Figure 5 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 6 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 7 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 8 illustrates an example of an image or video compression, transmission and decom- pression pipeline. Figure 9 illustratively shows an example sequence of layers of an image or video compression, transmission and decompression pipeline. Figure 10 illustratively shows an example sequence of layers of an image or video compression, transmission and decompression pipeline. Figure 11 illustrates an example of how optical flow information may be calculated between two images. Figure 12a illustrates steps of a MAD cost volume calculation. Figure 12b illustrates steps of a MAD cost volume calculation.
Figure 12c illustrates steps of a MAD cost volume calculation. Figure 13 illustrates steps of an RKADe cost volume calculation. Figure 14 illustrates an example of an image or video compression, transmission and decom- pression pipeline. DETAILED DESCRIPTION OF THE DRAWINGS Compression processes may be applied to any form of information to reduce the amount of data, or file size, required to store that information. Image and video information is an example of information that may be compressed. The file size required to store the information, particularly during a compression process when referring to the compressed file, may be referred to as the rate. In general, compression can be lossless or lossy. In both forms of compression, the file size is reduced. However, in lossless compression, no information is lost when the information is compressed and subsequently decompressed. This means that the original file storing the information is fully reconstructed during the decompression process. In contrast to this, in lossy compression information may be lost in the compression and decompression process and the reconstructed file may differ from the original file. Image and video files containing image and video data are common targets for compression. In a compression process involving an image, the input image may be represented as ^^. The data representing the image may be stored in a tensor of dimensions ^^ × ^^ × ^^, where ^^ represents the height of the image, ^^ represents the width of the image and ^^ represents the
number of channels of the image. Each ^^ × ^^ data point of the image represents a pixel value of the image at the corresponding location. Each channel ^^ of the image represents a different component of the image for each pixel which are combined when the image file is displayed by a device. For example, an image file may have 3 channels with the channels representing the red, green and blue component of the image respectively. In this case, the image information is stored in the RGB colour space, which may also be referred to as a model or a format. Other examples of colour spaces or formats include the CMKY and the YCbCr colour models. However, the channels of an image file are not limited to storing colour information and other information may be represented in the channels. As a video may be considered a series of images in sequence, any compression process that may be applied to an image may also be applied to a video. Each image making up a video may be referred to as a frame of the video. The output image may differ from the input image and may be represented by ^^. The difference between the input image and the output image may be referred to as distortion or a difference in image quality. The distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way. An example of such a method is using the mean square error (MSE) between the pixels of the input image and the output image, but there are many other ways of measuring distortion, as will be known to the person skilled in the art. The distortion function may comprise a trained neural network. Typically, the rate and distortion of a lossy compression process are related. An increase in the rate may result in a decrease in the distortion, and a decrease in the rate may result in an
increase in the distortion. Changes to the distortion may affect the rate in a corresponding manner. A relation between these quantities for a given compression technique may be defined by a rate-distortion equation. AI based compression processes may involve the use of neural networks. A neural network is an operation that can be performed on an input to produce an output. A neural network may be made up of a plurality of layers. The first layer of the network receives the input. One or more operations may be performed on the input by the layer to produce an output of the first layer. The output of the first layer is then passed to the next layer of the network which may perform one or more operations in a similar way. The output of the final layer is the output of the neural network. Each layer of the neural network may be divided into nodes. Each node may receive at least part of the input from the previous layer and provide an output to one or more nodes in a subsequent layer. Each node of a layer may perform the one or more operations of the layer on at least part of the input to the layer. For example, a node may receive an input from one or more nodes of the previous layer. The one or more operations may include a convolution, a weight, a bias and an activation function. Convolution operations are used in convolutional neural networks. When a convolution operation is present, the convolution may be performed across the entire input to a layer. Alternatively, the convolution may be performed on at least part of the input to the layer. Each of the one or more operations is defined by one or more parameters that are associated with each operation. For example, the weight operation may be defined by a weight matrix
defining the weight to be applied to each input from each node in the previous layer to each node in the present layer. In this example, each of the values in the weight matrix is a parameter of the neural network. The convolution may be defined by a convolution matrix, also known as a kernel. In this example, one or more of the values in the convolution matrix may be a parameter of the neural network. The activation function may also be defined by values which may be parameters of the neural network. The parameters of the network may be varied during training of the network. Other features of the neural network may be predetermined and therefore not varied during training of the network. For example, the number of layers of the network, the number of nodes of the network, the one or more operations performed in each layer and the connections between the layers may be predetermined and therefore fixed before the training process takes place. These features that are predetermined may be referred to as the hyperparameters of the network. These features are sometimes referred to as the architecture of the network. To train the neural network, a training set of inputs may be used for which the expected output, sometimes referred to as the ground truth, is known. The initial parameters of the neural network are randomized and the first training input is provided to the network. The output of the network is compared to the expected output, and based on a difference between the output and the expected output the parameters of the network are varied such that the difference between the output of the network and the expected output is reduced. This process is then repeated for a plurality of training inputs to train the network. The difference between the output of the network and the expected output may be defined by a loss function. The result of
the loss function may be calculated using the difference between the output of the network and the expected output to determine the gradient of the loss function. Back-propagation of the gradient descent of the loss function may be used to update the parameters of the neural network using the gradients ^^^^/^^^^ of the loss function. A plurality of neural networks in a system may be trained simultaneously through back-propagation of the gradient of the loss function to each network. In the context of image or video compression, this type of system, where simultaneous training with back-propagation through each element or the whole network architecture may be referred to as end-to-end, learned image or video compression. Unlike in traditional compression algorithms that use primarily handcrafted, manually constructed steps, an end-to-end learned system learns itself during training what combination of parameters best achieves the goal of minimising the loss function. This approach is advantageous compared to systems that are not end-to-end learned because an end-to-end system has a greater flexibility to learn weights and parameters that might be counter-intuitive to someone handcrafting features. It will be appreciated that the term "training" or "learning" as used herein means the process of optimizing an artificial intelligence or machine learning model, based on a given set of data. This involves iteratively adjusting the parameters of the model to minimize the discrepancy between the model’s predictions and the actual data, represented by the above-described rate-distortion loss function. The training process may comprise multiple epochs. An epoch refers to one complete pass of the entire training dataset through the machine learning algorithm. During an epoch, the
model’s parameters are updated in an effort to minimize the loss function. It is envisaged that multiple epochs may be used to train a model, with the exact number depending on various factors including the complexity of the model and the diversity of the training data. Within each epoch, the training data may be divided into smaller subsets known as batches. The size of a batch, referred to as the batch size, may influence the training process. A smaller batch size can lead to more frequent updates to the model’s parameters, potentially leading to faster convergence to the optimal solution, but at the cost of increased computational resources. Conversely, a larger batch size involves fewer updates, which can be more computationally efficient but might converge slower or even fail to converge to the optimal solution. The learnable parameters are updated by a specified amount each time, determined by the learning rate. The learning rate is a hyperparameter that decides how much the parameters are adjusted during the training process. A smaller learning rate implies smaller steps in the parameter space and a potentially more accurate solution, but it may require more epochs to reach that solution. On the other hand, a larger learning rate can expedite the training process but may risk overshooting the optimal solution or causing the training process to diverge. The training described herein may involve use of a validation set, which is a portion of the data not used in the initial training, which is used to evaluate the model’s performance and to prevent overfitting. Overfitting occurs when a model learns the training data too well, to the point that it fails to generalize to unseen data. Regularization techniques, such as dropout or L1/L2 regularization, can also be used to mitigate overfitting.
It will be appreciated that training a machine learning model is an iterative process that may comprise selection and tuning of various parameters and hyperparameters. As will be appreciated, the specific details, such as hyper parameters and so on, of the training process may vary and it is envisaged that producing a trained model in this way may achieved in a number of different ways with different epochs, batch sizes, learning rates, regularisations, and so on, the details of which are not essential to enabling the advantages and effects of the present disclosure, except where stated otherwise. The point at which an “untrained” neural network is considered be “trained” is envisaged to be case specific and depend on, for example, on a number of epochs, a plateauing of any further learning, or some other metric and is not considered to be essential in achieving the advantages described herein. More details of an end-to-end, learned compression process will now be described. It will be appreciated that in some cases, end-to-end, learned compression processes may be combined with one or more components that are handcrafted or trained separately. In the case of AI based image or video compression, the loss function may be defined by the rate distortion equation. The rate distortion equation may be represented by ^^^^^^^^ = ^^ + ^^ ∗ ^^, where ^^ is the distortion function, ^^ is a weighting factor, and ^^ is the rate loss. ^^ may be referred to as a lagrange multiplier. The langrange multiplier provides as weight for a particular term of the loss function in relation to each other term and can be used to control which terms of the loss function are favoured when training the network. In the case of AI based image or video compression, a training set of input images may be used. An example training set of input images is the KODAK image set (for example
at www.cs.albany.edu/ xypan/research/snr/Kodak.html). An example training set of input images is the IMAX image set. An example training set of input images is the Imagenet dataset (for example at www.image-net.org/download). An example training set of input images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at http://challenge.compression.cc/tasks/). An example of an AI based compression, transmission and decompression process 100 is shown in Figure 1. As a first step in the AI based compression process, an input image 5 is provided. The input image 5 is provided to a trained neural network 110 characterized by a function ^^^^ acting as an encoder. The encoder neural network 110 produces an output based on the input image. This output is referred to as a latent representation of the input image 5. In a second step, the latent representation is quantised in a quantisation process 140 characterised by the operation ^^, resulting in a quantized latent. The quantisation process transforms the continuous latent representation into a discrete quantized latent. An example of a quantization process is a rounding function. In a third step, the quantized latent is entropy encoded in an entropy encoding process 150 to produce a bitstream 130. The entropy encoding process may be for example, range or arithmetic encoding. In a fourth step, the bitstream 130 may be transmitted across a communication network. In a fifth step, the bitstream is entropy decoded in an entropy decoding process 160. The quantized latent is provided to another trained neural network 120 characterized by a function ^^^^ acting as a decoder, which decodes the quantized latent. The trained neural network 120
produces an output based on the quantized latent. The output may be the output image of the AI based compression process 100. The encoder-decoder system may be referred to as an autoencoder. Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly compress given input data up to close to the fundamental entropy limit of that data, as determined by the total entropy of the distribution of that data. Accordingly, one way in which end-to-end, learned compression can minimise the rate loss term of the rate-distortion loss function and thereby increase compression effectiveness is to learn autoencoder parameter values that produce low entropy latent representation distributions. Producing latent representations distributed with as low an entropy as possible allows entropy encoding to compress the latent distributions as close to or to the fundamental entropy limit for that distribution. The lower the entropy of the distribution, the more entropy encoding can losslessly compress it and the lower the amount of data in the corresponding bitstream. In some cases where the latent representation is distributed according to a gaussian or Laplacian distribution, this learning may comprise learning optimal location and scale parameters of the gaussian or Laplacian distributions, in other cases, it allows the learning of more flexible latent representation distributions which can further help to achieve the minimising of the rate-distortion loss function in ways that are not intuitive or possible to do with handcrafted features. Examples of these and other advantages are described in WO2021/220008A1, which is incorporated in its entirety by reference.
Something which is closely linked to the entropy encoding of the latent distribution and which accordingly also has an effect on the effectiveness of compression of end-to-end learned approaches is the quantisation step. During inference, a rounding function may be used to quantise a latent representation distribution into bins of given sizes, a rounding function is not differentiable everywhere. Rather, a rounding function is effectively one or more step functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary between steps). Back propagating a gradient of a loss function through a rounding function is challenging. Instead, during training, quantisation by rounding function is replaced by one or more other approaches. For example, the functions of a noise quantisation model are differentiable everywhere and accordingly do allow backpropagation of the gradient of the loss function through the quantisation parts of the end-to-end, learned system. Alternatively, a straight-through estimator (STE) quantisation model or one other quantisation models may be used. It is also envisaged that different quantisation models may be used for during evaluation of different term of the loss function. For example, noise quantisation may used to evaluate the rate or entropy loss term of the rate-distortion loss function while STE quantisation may be used to evaluate the distortion term. In a similar manner to how learning parameters to produce certain distributions of the latent representation facilitates achieving better rate loss term minimisation, end-to-end learning of the quantisation process achieves a similar effect. That is, learnable quantisation parameters provide the architecture with a further degree of freedom to achieve the goal of minimising the loss function. For example, parameters corresponding to quantisation bin sizes may be learned
which is likely to result in an improved rate-distortion loss outcome compared to approaches using hand-crafted quantisation bin sizes. Further, as the rate-distortion loss function constantly has to balance a rate loss term against a distortion loss term, it has been found that the more degrees of freedom the system has during training, the better the architecture is at achieving optimal rate and distortion trade off. Returning to the compression pipeline more generally, the systems described above may be distributed across multiple locations and/or devices. For example, the encoder 110 may be located on a device such as a laptop computer, desktop computer, smart phone or server. The decoder 120 may be located on a separate device which may be referred to as a recipient device. The system used to encode, transmit and decode the input image 5 to obtain the output image 6 may be referred to as a compression pipeline. As described above in the context of quantisation, the AI based compression process may further comprise a hyper-network 105 for the transmission of meta-information that improves the compression process. The hyper-network 105 comprises a trained neural network 115 acting as a hyper-encoder ^^ ℎ ℎ ^^ and a trained neural network 125 acting as a hyper-decoder ^^ ^^. An example of such a
is shown in Figure 2. Components of the system not further discussed may be assumed to be the same as discussed above. The neural network 115 acting as a hyper-decoder receives the latent that is the output of the encoder 110. The hyper-encoder 115 produces an output based on the latent representation that may be referred to as a hyper-latent representation. The hyper-latent is then quantized in a quantization process 145 characterised
by ^^ℎ to produce a quantized hyper-latent. The quantization process 145 characterised by ^^ℎ may be the same as the quantisation process 140 characterised by ^^ discussed above. In a similar manner as discussed above for the quantized latent, the quantized hyper-latent is then entropy encoded in an entropy encoding process 155 to produce a bitstream 135. The bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the quantized hyper-latent. The quantized hyper-latent is then used as an input to trained neural network 125 acting as a hyper-decoder. However, in contrast to the compression pipeline 100, the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder 115. Instead, the output of the hyper-decoder is used to provide parameters for use in the entropy encoding process 150 and entropy decoding process 160 in the main compression process 100. For example, the output of the hyper-decoder 125 can include one or more of the mean, standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation. In the example shown in Figure 2, only a single entropy decoding process 165 and hyper-decoder 125 is shown for simplicity. However, in practice, as the decompression process usually takes place on a separate device, duplicates of these processes will be present on the device used for encoding to provide the parameters to be used in the entropy encoding process 150. Further transformations may be applied to at least one of the latent and the hyper-latent at any stage in the AI based compression process 100. For example, at least one of the latent and the hyper latent may be converted to a residual value before the entropy encoding process 150,155
is performed. The residual value may be determined by subtracting the mean value of the distribution of latents or hyper-latents from each latent or hyper latent. The residual values may also be normalised. To perform training of the AI based compression process described above, a training set of input images may be used as described above. During the training process, the parameters of both the encoder 110 and the decoder 120 may be simultaneously updated in each training step. If a hyper-network 105 is also present, the parameters of both the hyper-encoder 115 and the hyper-decoder 125 may additionally be simultaneously updated in each training step. The training process may further include a generative adversarial network (GAN). When applied to an AI based compression process, in addition to the compression pipeline described above, an additional neutral network acting as a discriminator is included in the system. The discriminator receives an input and outputs a score based on the input providing an indication of whether the discriminator considers the input to be ground truth or fake. For example, the indicator may be a score, with a high score associated with a ground truth input and a low score associated with a fake input. For training of a discriminator, a loss function is used that maximizes the difference in the output indication between an input ground truth and input fake. When a GAN is incorporated into the training of the compression process, the output image 6 may be provided to the discriminator. The output of the discriminator may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Alternatively, the discriminator may receive both the input image 5 and the output image 6 and the difference in output indication may then be used in the loss function of the
compression process as a measure of the distortion of the compression process. Training of the neural network acting as a discriminator and the other neutral networks in the compression process may be performed simultaneously. During use of the trained compression pipeline for the compression and transmission of images or video, the discriminator neural network is removed from the system and the output of the compression pipeline is the output image 6. Incorporation of a GAN into the training process may cause the decoder 120 to perform hallucination. Hallucination is the process of adding information in the output image 6 that was not present in the input image 5. In an example, hallucination may add fine detail to the output image 6 that was not present in the input image 5 or received by the decoder 120. The hallucination performed may be based on information in the quantized latent received by decoder 120. Details of a video compression process will now be described. As discussed above, a video is made up of a series of images arranged in sequential order. AI based compression process 100 described above may be applied multiple times to perform compression, transmission and decompression of a video. For example, each frame of the video may be compressed, transmitted and decompressed individually. The received frames may then be grouped to obtain the original video. The frames in a video may be labelled based on the information from other frames that is used to decode the frame in a video compression, transmission and decompression process. As described above, frames which are decoded using no information from other frames may be referred to as I-frames. Frames which are decoded using information from past frames may be
referred to as P-frames. Frames which are decoded using information from past frames and future frames may be referred to as B-frames. Frames may not be encoded and/or decoded in the order that they appear in the video. For example, a frame at a later time step in the video may be decoded before a frame at an earlier time. The images represented by each frame of a video may be related. For example, a number of frames in a video may show the same scene. In this case, a number of different parts of the scene may be shown in more than one of the frames. For example, objects or people in a scene may be shown in more than one of the frames. The background of the scene may also be shown in more than one of the frames. If an object or the perspective is in motion in the video, the position of the object or background in one frame may change relative to the position of the object or background in another frame. The transformation of a part of the image from a first position in a first frame to a second position in a second frame may be referred to as flow, warping or motion compensation. The flow may be represented by a vector. One or more flows that represent the transformation of at least part of one frame to another frame may be referred to as a flow map. An example AI based video compression, transmission, and decompression process 200 is shown in Figure 3. The process 200 shown in Figure 3 is divided into an I-frame part 201 for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be understood that these divisions into different parts are arbitrary and the process 200 may be also be considered as a single, end-to-end pipeline.
As described above, I-frames do not rely on information from other frames so the I-frame part 201 corresponds to the compression, transmission, and decompression process illustrated in Figures 1 or 2. The specific details will not be repeated here but, in summary, an input image ^^0 is passed into an encoder neural network 203 producing a latent representation which is quantised and entropy encoded into a bitstream 204. The subscript 0 in ^^0 indicates the input image corresponds to a frame of a video stream at position t = 0. This may be the first frame of an entire video stream or the first frame of a chunk of a video stream made up of, for example, an I-frame and a plurality of subsequent P-frames and/or B-frames. The bitstream 204 is then entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed image ^^0 which in this case is an I-frame. The decoding step may be performed both locally at the same location as where the input image compression occurs as well as at the location where the decompression occurs. This allows the reconstructed image ^^0 to be available for later use by components of both the encoding and decoding sides of the pipeline. In contrast to I-frames, P-frames (and B-frames) do rely on information from other frames. Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only the input image ^^^^ that is to be compressed (corresponding to a frame of a video stream at position t), but also one or more previously reconstructed images ^^^^−1 from an earlier frame t-1. As described above, the previously reconstructed ^^^^−1 is available at both the encode and decode side of the pipeline and can accordingly be used for various purposes at both the encode and decode sides.
At the encode side, previously reconstructed images may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames. In the example of Figure 3, both the image being compressed ^^^^ and the previously reconstructed image from an earlier frame ^^^^−1 are passed into a flow module part 206 of the pipeline. The flow module part 206 comprises an autoencoder such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 207 has been trained to produce a latent representation of a flow map from inputs ^^^^−1 and ^^^^ , which is indicative of inter-frame movement of pixels or pixel groups between ^^^^−1 and ^^^^ . The latent representation of the flow map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208. On the decode side, the bitstream is entropy decoded and passed to a decoder neural network 209 to produce a reconstructed flow map ^^ . The reconstructed flow map ^^ is applied to the previously reconstructed image ^^^^−1 to generate a warped image ^^^^−1,^^. It is envisaged that any suitable warping technique may be used, for example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further envisaged that a scale-space flow approach as described in the above paper may also optionally be used. The warped image ^^^^−1,^^ is a prediction of how the previously reconstructed image ^^^^−1 might have changed between frame positions t-1 and t, based on the output flow map produced by the flow module part 206 autoencoder system from the inputs of ^^^^ and ^^^^−1.
As with the I-frame, the reconstructed flow map ^^ and corresponding warped image ^^^^−1,^^ may be produced both on the encode side and the decode side of the pipeline so they are available for use by other components of the pipeline on both the encode and decode sides. In the example of Figure 3, both the image being compressed ^^^^ and the ^^^^−1,^^ are passed into a residual module part 210 of the pipeline. The residual module part 210 comprises an autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 211 has been trained to produce a latent representation of a residual map indicative of differences between the input mage ^^^^ and the warped image ^^^^−1,^^. The latent representation of the residual map is then quantised and entropy encoded into a bitstream 212 and transmitted. The bitstream 212 is then entropy decoded and passed into a decoder neural network 213 which reconstructs a residual map ^^ from the decoded latent representation. Alternatively, a residual map may first be pre-calculated between ^^^^ and the ^^^^−1,^^ and the pre-calculated residual map may be passed into an autoencoder for compression only. This hand-crafted residual map approach is computationally simpler, but reduces the degrees of freedom with which the architecture may learn weights and parameters to achieve its goal during training of minimising the rate-distortion loss function. Finally, on the decode side, the residual map ^^ is applied (e.g. combined by addition, subtraction or a different operation) to the warped image to produce a reconstructed image ^^^^ which is a reconstruction of image ^^^^ and accordingly corresponds to a P-frame at position t in a sequence of frames of a video stream. It will be appreciated that the reconstructed image ^^^^ can then be
used to process the next frame. That is, it can be used to compress, transmit and decompress ^^^^+1, and so on until an entire video stream or chunk of a video stream has been processed. Thus, for a block of video frames comprising an I-frame and ^^ subsequent P-frames, the bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual map of each P-frame image. For completeness, whilst not illustrated in Figure 3, any of the autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those described in connection with Figure 2. Accordingly, the bitstream may also contain hyper and hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so on, of those networks as applicable. Finally, the above approach may generally also be extended to B-frames, for example as is described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6680-6689). The above-described flow and residual based approach is highly effective at reducing the amount of data that needs to be transmitted because, as long as at least one reconstructed frame (e.g. I-frame ^^^^−1) is available, the encode side only needs to compress and transmit a flow map and a residual map (and any hyper or hyper-hyper parameter information, as applicable) to reconstruct a subsequent frame.
Figure 4 shows an example of an AI image or video compression process such as that described above in connection with Figures 1-3 implemented in a video streaming system 400. The system 400 comprises a first device 401 and a second device 402. The first and second devices 401, 402 may be user devices such as smartphones, tablets, AR/VR headsets or other portable devices. In contrast to known systems which primarily perform inference on GPUs such as Nvidia A100, Geforce 3090, Gefore 4090 GPU cards, the system 400 of Figure 4 performs inference on a CPU (or NPU or TPU as applicable) of the first and second devices respectively. That is, compute for performing both encoding and decoding are performed by the respective CPUs of the first and second devices 401, 402. This places very different power usage, memory and runtime constraints on the implementation of the above methods than when implementing AI-based compression methods on GPUs. In one example, the CPU of first and second devices 401, 402 may comprise, for example, a Qualcomm Snapdragon CPU. The first device 401 comprises a media capture device 403, such as a camera, arranged to capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404. The video stream 404 is passed to a pre-processing module 406 which splits the video stream into blocks of frames, various frames of which will be designated as I-frames, P-frames, and/or B-frames. The blocks of frames are then compressed by an AI-compression module 407 comprising the encode side of the AI-based video compression pipeline of Figure 3. The output of the AI-compression module is accordingly a bitstream 408a which is transmitted from the first device 401, for example via a communications channel, for example over one or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409 communications.
The second device 402 receives the communicated bitstream 408b which is passed to an AI-decompression module 410 comprising the decode side of the AI-based video compression pipeline of Figure 3. The output of the AI-decompression module 402 is the reconstructed I-frames, P-frames and/or B-frames which are passed to a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402. It is envisaged that the system 400 of Figure 4 may be used for live video streaming at 30fps of a 1080p video stream, which means a cumulative latency of both the encode and decode side is below substantially 50ms, for example substantially 30ms or less. Achieving this level of runtime performance with only CPU (or NPU or TPU) compute on user devices presents challenges which are not addressed by known methods and systems or in the wider AI-compression literature. For example, execution of different parts of the compression pipeline during inference may be optimized by adjusting the order in which operations are performed using one or more known CPU scheduling methods. Efficient scheduling can allow for operations to be performed in parallel, thereby reducing the total execution time. It is also envisaged that efficient management of memory resources may be implemented, including optimising caching methods such as storing frequently-accessed data in faster memory locations, and memory reuse, which minimizes memory allocation and deallocation operations. A number of concepts related to the AI compression processes and/or their implementation in a hardware system discussed above will now be described. Although each concept is
described separately, one or more of the concepts described below may be applied in an AI based compression process as described above. Concept 1: Training super resolution with learned image and video compression Figure 5 illustrates an example of an image or video compression, transmission and decompres- sion pipeline 500. The pipeline illustrates a method of the present disclosure that corresponds to that described in relation to Figures 1 to 4. Like-numbered features correspond to those in Figures 1 to 4. However in Figure 5, pipeline is further wrapped in a super-resolution wrapper. That is, the encoder ^^ is preceded by a downsampler 501, and decoder ^^ is followed by an upsampler 502. We first introduce the super-resolution wrapper 501, 502 around the pipeline 500 during training which comprises making the evaluated loss function be based on one or more terms from the list of: a difference between the output image ^^ and the input image ^^ , a difference between the output image ^^ and the downsampled input image ^^, a difference between the upsampled output image ^̂^ and the input image ^^ , and/or a difference between the upsampled output image ^̂^ and the downsampled input image ^^. Modifying the loss function in this way allows the pipeline to be "super-resolution" aware. That is, the loss function comprises a term that is not just based on differences between the input to the encoder and the output of the decoder, but also or alternatively on the input ^^ and the output ^^ of the downsampler 501 and/or the input ^^ and output ^^ of the upsampler 502. In this way, the weights of the neural networks of the pipeline 500 (e.g. the encoder ^^ , the decoder ^^, and/or the corresponding hyper encoder and hyper decoders) will be optimised to allow the encoder ^^ to produce a
latent representation that has a distribution that is optimally entropy encodable to hit low target bit rates while at the same time allow the decoder ^^ to output images that have distributions that can be optimally upsampled by the upsampler 502 to produce upsampled output images ^̂^ that are as close to the original input images ^^ as possible. More generally, making the neural compression pipeline 500 super-resolution aware in this way during training results in trained networks of the pipeline 500 that, when wrapped in the super-resolution down- and/or up-samplers 501, 502 during inference, produces output images ^̂^ that are closer to the the input images ^^ than a network or networks that were not trained in a super-resolution aware manner. Accordingly, the method comprises receiving an input image ^^ at a first computer system, downsampling the input image with a downsampler 501 to produce a downsampled input image ^^, encoding the downsampled input image ^^ using a first neural network ^^ to produce a latent representation, decoding the latent representation using a second neural network ^^ to produce an output image ^^, wherein the output image ^^ is an approximation of the input image (e.g. of the downsampled input image ^^ which in turn is an approximation of the original input image ^^), upsampling the output image ^^ with an upsampler 502 to produce an upsampled output image ^̂^ , evaluating a function (i.e. a loss function) based on a difference between one or more of: the output image ^^ and the input image ^^ , the output image ^^ and the downsampled input image ^^, the upsampled output image ^̂^ and the input image ^^ , and/or the upsampled output image ^̂^ and the downsampled input image ^^, updating the parameters of the first neural network ^^ and the second neural network ^^ based on the evaluated function,
and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. In the above-described method, it is envisaged that the upsampler 502 may comprise a third neural network, and the method further comprises updating the parameters of the third neural network based on the evaluated function. This is illustrated in Figure 6 showing a pipeline 600 corresponding to that of Figure 5 but where the upsampler 602 is shown as a neural network ^^ . The upsampler neural network ^^ is responsible for upscaling the output image obtained from the second neural network back to its original size or resolution. The upsampler neural network ^^ may comprise a convolutional neural network architecture. For example, the upsampler may be implemented using transposed convolutions or deconvolution layers. These layers perform the inverse operation of regular convolutions and can be used to increase the spatial resolution of an image. The upsampling process can be further enhanced using various techniques such as skip connections or residual connections. Skip connections allow for direct transmission of information from earlier layers in the network to later layers, bypassing some of the intermediate layers and thereby allowing the model to leverage detailed information present in the initial stages of processing. Residual connections add the output of a layer directly to the input of another layer, effectively performing addition or subtraction operations within the network.
These techniques can improve the accuracy and stability of the neural network-based upsampler by allowing it to better capture fine details in the image. More specifically, a neural network based upsampler such as ^^ can be trained together, e.g. in an end-to-end manner with the trainable neural networks of the neural compression pipeline 600, making the entire pipeline super-resolution aware. This is an important distinction compared to simply applying a super resolution wrapper to a neural or traditional compression pipeline. This is because it affords the freedom of the networks of the pipeline to learn to produce latent representations that compress optimally that get decoded into output images ^^ that may not be visually pleasing or indeed look to the human visual system like an accurate reconstruction of the original input images ^^ or their downsampled versions ^^, but which nonetheless have pixel value distributions that the upsampler neural network ^^ can take better advantage of to produce upsampled output images ^̂^ that are more accurate approximations of the original input images ^^ than an upsampler that is acting on an output of a compression pipeline that is not super-resolution aware (in this case upsampler aware) In the example of Figure 6, the downsampler may be a traditional downsampler, for example it is envisaged that the downsampler may comprise a bilinear or bicubic downsampler. Bilinear and bicubic downsampling are exemplary methods used for image resizing. They involve reducing the resolution of an input image, for example by a factor of 2x2 (e.g., from 100x100 to 50x50). Further exemplary details of bilinear and bicubic downsampling are provided below.
Bilinear Downsampling: In bilinear downsampling, the algorithm approximates the original pixel values based on the average intensity values of its surrounding pixels in the scaled image. This method assumes that the pixel intensities are uniformly distributed across the image. The basic implementation for a 2x2 downsampling operation is as follows: 1. Read the four input pixels (A, B, C, and D). 2. Set the output pixel value to be an average of the input pixels. 3. Replace the output pixel positions with their respective calculated values from step 2. Bicubic Downsampling: In bicubic downsampling, the algorithm uses a third-order polynomial function to approximate the original pixel values based on the intensities of a neighborhood surrounding the current pixel. This method takes into account more details than bilinear interpolation but requires more computational resources. The basic implementation for a 2x2 downsampling operation is as follows: 1. Read the four input pixels (A, B, C, and D). 2. Calculate the coefficients of the third-order polynomial function. 3. Calculate the output pixel values: x and y, using the coefficients obtained in step 2. In the context of the the compression pipeline 600 shown in Figure 6, either bilinear and bicubic downsampling methods can be used. The choice between these two methods depends on the desired tradeoff between computation complexity and visual quality of the output images
whereby bilinear is envisaged to be quicker and accordingly contributes to faster run times on the encode side of the pipeline while retaining the neural network based upsampler ^^ to emphasise reconstruction accuracy. Optionally, the evaluated function (i.e. the loss function) may further be based on said differences comprising a structural similarity index measure (SSIM). The SSIM is a quality metric that compares two images in terms of their structure and contrast. For example, when used here it evaluates the similarity between an output image and its corresponding input image or downsampled input image, upsampled output image, and/or any combination thereof. By using the SSIM as the evaluation function, the method aims to optimize the neural networks for preserving the structural information in the images during the encoding and decoding process, thus improving the overall quality of the generated output images. As the human visual system is often able to perceive these higher level structural information, it allows the nerwork to learn to optimise for this type of difference over a simpler mean square error (MSE) loss. Alternatively, MSE may be used as it is quicker and simpler to calculate and can accordingly speed up training times. In the above-described method, the downsampling can be performed using a Gaussian blur filter. That is, it is envisaged that downsampling can be achieved through an implementation wherein the input image ^^ is filtered with a Gaussian blur. The Gaussian blur filter is a type of low-pass filter that smoothes out the image by reducing high-frequency details while preserving lower frequency information. This helps to reduce the complexity of the input image and makes it easier for the first, second and third neural networks to learn the underlying
patterns in the data and may help the loss to converge during training. In this implementation, the downsampled input image ^^ produced is a smoother representation of the original input image, which can be used as an input for encoding using the first neural network ^^ . It is envisaged that end-to-end training is preferable as it makes the pipeline fully super- resolution aware. However, it can be challenging to get the loss to converge during training and/or for training to be stable when all neural networks are being optimised simultaneously. This training instability and slow convergence can be mitigated by splitting the training into multiple phases. For example, the method may comprise (i) updating the parameters of the first neural network ^^ and the second neural network ^^ based on the evaluated function for a first number of said steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the third neural network ^^ , (ii) freezing the parameters of the first and second neural networks ^^ , ^^ after said number first number of steps, and performing said upsampling and downsampling, and said updating of the parameters of the third neural network ^^ for a second number of said steps. That is, the underlying compression pipeline 600 is trained first. After this initial training phase, the parameters of the first and second neural networks ^^ , ^^ are frozen, and then the system proceeds with a secondary training phase where it updates the third neural network’s ^^ parameters. As described above, this split training approach can mitigate training instability.
Moving now to Figure 7 which shows a compression pipeline 700 similar to that of Figure 5, except now the downsampler comprises a neural network 701. The features that correspond to those of Figure 5 are not repeated here for brevity. More specifically, it is envisaged that the downsampler 701 may consist of a fourth neural network ^^. Further the training method also comprises updating the parameters of this fourth neural network ^^ based on the evaluated function. The fourth neural network ^^, referred to as the downsampler 701, is responsible for reducing the spatial resolution of the input image while preserving important features and details in order to process them efficiently during encoding by the first neural network ^^ . In a similar manner as described in connection with the upsampler in Figure 6, the downsampler neural network ^^ may be trained in an end-to-end manner with the first and second neural networks ^^ , ^^ of the compression pipeline 700. This approach provides the end-to-end system with an extra degree of freedom to produce downsampled input images ^^ that may not be in any way visually pleasing or accurate representations of ^^ but which can be optimally encoded by ^^ into a latent representation that is distributed in a way that is efficiently entropy encodable and which can be decoded and subsequently upsampled into more accurate reconstructions of ^^ than would otherwise be possible with a pipeline that is not super-resolution aware (in this case down-sampler aware). That is, unlike in traditional super resolution approaches where the downsampling attempts to produce as accurate or visually pleasing representations of the input image, the output of the neural network downsampler ^^ can produce whatever output is needed by the neural
compression pipeline to help achieve a target bitrate while maintaining final output image accuracy. One examplary downsampler having a neural network architecture is a network comprising a plurality of convolutional layers with decreasing filter sizes and increasing strides, as this approach can effectively reduce the spatial dimensions of the input image while maintaining its overall structure. In more detail, let’s consider a simple example of a downsampling process using a convolutional neural network (CNN). The CNN architecture typically consists of multiple layers, each layer being composed of several filters applied across the spatial dimensions of the input image. These filters are learnable parameters that enable the network to extract various features from the image and recognize complex patterns or shapes within it. In the case of a downsampler, we can start with an initial convolutional layer having a large filter size (e.g., 7x7) and small strides (e.g., 2x2). This combination results in a significant reduction in spatial dimensions while still allowing the network to capture essential information from the input image. Subsequent layers can then use smaller filters (e.g., 3x3, 5x5) with larger strides (e.g., 2x2, 4x4), further reducing the size of the feature maps while also encouraging more localized receptive fields within the network. One drawback of large kernel sizes is that they are more computationally expensive than smaller kernels, even if they are strictly speaking more expressive.
The choice of downsampler, and filter sizes and strides in a downsampler controls the balance between preserving important image details and efficiently processing the data. In practice, the downsampler may comprise multiple convolutional layers with decreasing filter sizes and increasing strides, followed by one or more max-pooling layers to further reduce the spatial dimensions of the input image. Reducing the number of layers and using small filters or kernels helps to speed up run time. A further illustrative downsampler may be comprise a network architecture with a number of layers with a stride greater than 1. Every such layer will downsample by a factor of the stride. While the filter or kernel sizes can affect the spatial dimensions of the input (i.e. bigger kernel resulting in greater downsampling) zero padding may be applied in such a way that the output remains the same size as the input if stride equals 1. This type of downsampler structure is typically fast and accordingly works well in the context of real time or near real time compression. As alluded to above, in Figure 7, it is envisaged that the upsampler may comprise a bilinear or bicubic upsampler. Bilinear upsampling (or interpolation) is a method for upsampling an image. It involves estimating pixel values by linearly interpolating between the neighboring pixels in the original and downsampled images. An example implementation algorithm may be as follows:
1. For each pixel in the output image, find its corresponding location or pixel-coordinate in the input image. 2. Find four nearby pixel coordinates around this central coordinate of the input image. These are typically referred to as northeast (NE), southeast (SE), northwest (NW), and southwest (SW). 3. Compute a weighted average of these four pixels, where the weights depend on their distances from the desired output pixel location. The weights are usually determined by a bilinear function. 4. Repeat steps 1-3 for every pixel in the output image. Bicubic upsampling (or interpolation) is technique to upscale an image. It is similar to bilinear interpolation but uses a bicubic function instead of a linear one. The algorithm is slightly more complex, and the resulting images tend to have smoother edges than those produced by bilinear interpolation. An example implementaiton algorithm may be as follows: 1. For each pixel in the output image, find its corresponding pixel in the input image. 2. Find 16 nearby pixel coordinates around this central coordinate of the input image. These are typically referred to as northeast (NE), north-northeast (NNE), northwest (NW), southwest (SW), southeast (SE), south-southeast (SSE), south (SO), and south-southwest (SSW). 3. Fit a bicubic function with coefficients that comprise weighted sums of the values of the input pixels.
4. Repeat steps 1-3 for every pixel in the output image. Both bilinear and bicubic interpolation can produce reasonably good results when upsampling images, but the choice between them will depend on the specific use case and desired level of detail preservation. In the present case, bilinear is envisaged to be preferred as it is faster and can reduce runtime while working in a pipeline 700 with a (typically slower) neural network based downsampler such as ^^ on the encode side. As was the case with the neural network upsampler ^^ , training in a fully end-to-end manner may introduce training instability and slow convergence. To address this, the training may be split into phases. For example, the method may comprise (i) updating the parameters of the first neural network and the second neural network based on the evaluated function for a first number of said steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the fourth neural network, (ii) freezing the parameters of the first and second neural networks after said number of steps, and performing said downsampling and said updating of the parameters of the fourth neural network for a second number of said steps. More generally, training a neural network based downsampler ^^ in an end-to-end manner is further complicated by it not being straightforward as to what the downsampler’s training objective should be i.e. what its loss ought to be based on. For example, should the loss terms that include the output of the downsampler ^^ compare the input image ^^ and the immediate output of the downsampler ^^ i.e. ^^, or on a downsampled version of ^^ that was produced by a
previously downsampled image (e.g. created by a traditional downsampling method) to try to teach the downsampler to mimic a traditional downsampler, or on some other difference. The present inventors have found that as long as the loss function includes a term based on comparing the output of ^^ with something that is not just the original input image ^^ but also some other output of the pipeline and/or a downsampled image produced by a traditional method of downsampling, the loss converges more quickly indicating the neural network downsampler is learning to become super resolution aware. It is also envisaged that both the upsampler and downsampler may be neural networks. This is shown in Figure 8 which corresponds to Figure 5 but where the upsamplers and downsamplers comprise neural networks. Like-numbered references indicate like features which are not repeated here for brevity. Specifically Figure 8 illustratively shows a pipeline 800 comprising a neural network downsampler ^^ 801 and a neural network upsampler ^^ 802 wrapped around a neural compression pipeline. It is envisaged that these may be trained in an end to end fashion. More specifically, by making the loss function be based on comparisons between not just the input image ^^ and the final upsampled output image ^̂^ , but also between the outputs of the downsampler ^^, and the various other outputs of the pipeline, as well as optionally a previously downsampled image, the network learns to become super-resolution aware and outperforms networks where the training of the neural compression pipeline is not connected in any way either through training or through the calculated comparisons of the terms of the loss function. Considering all of the above approaches more generally now, it is envisaged that the methods described above may comprise entropy encoding the latent representation into a bitstream
with a specified length, wherein the function used in the method is also dependent on said bitstream length. That is, the loss function further comprises a rate term. Including the rate term in the loss function allows the networks to learn to optimise for bit rates (e.g. in bits per pixel) simultaneously with image reconstruction accuracy. Some or all of the loss terms may be scaled or weighted with respect to each other to focus the learning on any of the objectives as defined by the different loss terms. Additionally and/or alternatively, the difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image may be determined based on the output of a fifth neural network acting as a discriminator. It will be understood that the output of the discriminator may be the differentiation between a ground truth image and a "fake" (i.e. compressed) reconstructed image during training. The discriminator loss term used in the training of the encoder/decoders of the AI compression pipeline are only a function of the compressed image. In essence, the training tries to encourage the neural networks to change in such a way that its output will be more realistic and like the ground truth image. Faithfulness to the ground truth image is taken care of by the distortion loss term (e.g. mean squared error) or other loss. The evaluation of the function based on these differences guides the process of updating the parameters of the first neural network (encoder) and the second neural network (decoder), leading to improved performance and
better results over time. This approach can help to improve the overall quality of the generated output images. It is accordingly also envisaged that the difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image comprises a mean squared error (MSE) and/or a structural similarity index measure (SSIM). These correspond to the distortion term of the loss function. MSE measures the average squared difference between two images, while SSIM computes structural similarity based on luminance, contrast, and structure. By using these metrics, the method is able to optimize the parameters of the neural networks to better approximate the input image in subsequent iterations, resulting in improved image reconstruction performance over multiple runs with different sets of input images. In the method as described above, the function may further comprise a term defining a visual perceptual metric that models how a human visual system may perceive differences. In the above-described method, it is envisaged that the term defining a visual perceptual metric may comprise a MS-SIM loss. This loss function serves to gauge how effectively the network is approximating the input image with the output image. By iteratively minimizing this loss function through parameter updates in the neural networks, the trained neural network improves its ability to generate an output image that closely resembles the input image.
It is further noted that the above described methods may be used in the context of any pre-trained neural compression network, and accordingly the present disclosure envisages a method where only the weights of the upsampler and/or downsampler are updated during training. Accordingly, such a method comprises receiving an input image at a first computer system, encoding the input image using a first neural network to produce a latent representation, decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image. Next, the output image is upsampled with an upsampler comprising a third neural network. Thereafter, the difference between one or both of: the output image and the input image, and/or the upsampled output image and the input image is evaluated using a function, parameters of the third neural network are updated based on the evaluated function. These steps are then repeated using a first set of input images to produce a first trained neural network and a second trained neural network. Or in the case where only the downsampler weights are trained, the present disclosure envisages a method comprising receiving an input image at a first computer system, downsampling the input image with a downsampler comprising a fourth neural network to produce a downsampled input image, encoding the downsampled input image using a first neural network to produce a latent representation, decoding the latent representation using a second neural network to produce an output image that approximates the input image, evaluating a function based on differences between the output image and other images, updating the parameters of the fourth neural network based on the evaluated function, and repeating these steps with a first set of input images to create trained versions of the first and second neural networks.
As described above, it is envisaged that producing the previously downsampled input image may be performed by either bilinear or bicubic downsampling techniques. Finally the present disclosure also proposes using a network trained in accordance with the above-described methods. That is: receiving an input image at a first computer system, downsampling the input image with a downsampler, encoding the downsampled input image using a first trained neural network to produce a latent representation, transmitting the latent representation to a second computer system, decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image, and upsampling the output image with an upsampler to produce an upsampled output image. A specific example use cases of the above described super resolution approaches will now be described: more flexible bitrate ladders. A bitrate ladder refers to a set of predefined bitrates applied within an encoded file. In the context of video compression, it refers to a series of different bitrates that can be chosen to achieve the desired trade-off between video quality and file size. In general, when encoding video files, a balance is struck between two conflicting goals: achieving high visual quality while minimizing the file size. The process involves converting the raw data into a compressed format that requires less storage space.
To accomplish this task in traditional compression, a number of known algorithms are used, often dictated by compression codec standards. One such standard is H.264/AVC (Advanced Video Coding), which is widely adopted due to its balance between encoding complexity and image quality. The H.264 standard includes various profiles and levels that define the maximum bitrate and other parameters for a given video stream. Implementers of the standard typically target these profiles and levels to ensure their implementation is standard compliant. More specifically, When using these standards, implementers can choose from different preset bitrates. These predefined bitrates are often referred to as a "ladder" because they represent a series of steps or options available when choosing the optimal encoding settings for a given video file. Typically, bitrate ladders make use of the idea that the encode resolution can be varied a priori whereby the streaming provider has already pre-encoded its content at a plurality of different resolutions, which in turn facilitates giving an end user a choice of what quality setting to apply given some particular bitrate budget. The bitrate ladder approach works by progressively decreasing the bitrate from one level to another, allowing a given balance between quality and file size to be found. For example, starting at a higher-than-desired bitrate, you can gradually reduce it until you reach an acceptable level of visual degradation without sacrificing too much detail or clarity. In a neural compression pipeline, the neural networks are typically trained to perform optimally at a given bitrate. Accordingly, covering all the predefined bitrates of a given bitrate ladder may require training separate neural networks for each predefined bitrate which can be burdensome and may result in the final codec library memory footprint being potentially very large.
To overcome this issue, the above-described approaches may be used to make the base, neural compression neural networks super-resolution aware and allows a given base, neural compression pipeline to be used not only for compression to its targeted bitrate, but also to other bitrates in the bitrate ladder by applying the super-resolution wrapper around the base models when desired. This in turn makes it more viable to use neural compression pipelines in the implementation of a bitrate ladder to comply with a given codec standard, or to aim for an entirely different bitrate ladder that provides even better rate-quality trade offs than traditional codec bitrate ladders, given how significantly better neural compression pipelines can compress images and video compared to known codecs. Concept 2: Lightweight convolutional downsampling and upsampling As has been explained above in the general section, there is a wide gap between the ideation of high level ideas, the research-stage implementations of super resolution architectures, and the production level implementation of such architectures. This is particularly the case where the implementation is intended to be performed in real time or near real time on resource-constrained hardware, such as edge devices. For example, a research-stage approach that works well and fast on a GPU such as an NVIDIA 4090, A10 or A100 card is very unlikely to achieve the same performance on resource-constrained mobile device platforms such as laptops, tablets and smartphones. One area of AI-based video compression where this is particularly problematic is in the implementation of downsampling and upsampling algorithms.
More particularly, one common component of such upsampling and downsampling algorithms is a process known as PixelShuffle and PixelUnshuffle. Both operations manipulate the arrangement of data in tensors (multi-dimensional arrays) that represent images. PixelShuffle is often used in super-resolution models. That is, in general terms, PixelShuffle increases the resolution of an input image by rearranging the elements of a tensor. The pseudocode below illustrates a PixelShuffle operation: Input Tensor: shape [^^^^^^^^ℎ^^^^^^^^, ^^ ∗ ^^2, ^^, ^^], where: C is the number of channels (e.g., 3 for an RGB image), r is the upscale factor, H and W are the height and width of the tensor. Rearrangement of Data: PixelShuffle rearranges elements in this tensor to form a new tensor of shape [^^^^^^^^ℎ^^^^^^^^, ^^, ^^ ∗ ^^, ^^ ∗ ^^]. Essentially, it "shuffles" the data from the channel dimension into the spatial dimensions (height and width). Upscaling: This operation effectively upscales the image by a factor of r, increasing the resolution. For example, if r = 2, each pixel in the original image is rearranged to form a 2x2 block in the output image. Application in Super-Resolution: In super-resolution models like EDSR or SRGAN, PixelShuf- fle is used in the latter stages to upscale the low-resolution input to a high-resolution output. It’s a part of the sub-pixel convolution technique where the model first increases the number of channels with additional convolutions and then uses PixelShuffle to upscale the image spatially.
PixelUnshuffle is the reverse operation of PixelShuffle. It’s used to decrease the spatial resolution of an image while increasing the number of channels. The pseudocode below illustrates a PixelUnshuffle operation: Input Tensor: shape [^^^^^^^^ℎ^^^^^^^^, ^^, ^^, ^^]. Rearrangement of Data: PixelUnshuffle rearranges elements to form a new tensor of shape [^^^^^^^^ℎ^^^^^^^^, ^^ ∗ ^^2, ^^/^^, ^^/^^]. It does so by taking spatial blocks of size r x r and stacking them depth-wise in the channel dimension. Downscaling: This process effectively downscales the image by a factor of r, reducing its spatial dimensions. For example, if r = 2, a 2x2 block of pixels in the input image is rearranged into a single pixel in the output, with the depth (channels) increased by a factor of 4. Application: PixelUnshuffle can be used in tasks like feature extraction, where reducing spatial resolution while retaining information in the channel dimension might be beneficial. It’s also useful in certain generative models or autoencoders where manipulating spatial resolution at different stages of the network is required. More generally, PixelShuffle is used for upscaling an image by rearranging the channel data into the spatial dimensions, whereas PixelUnshuffle does the opposite, downscaling an image by rearranging spatial data into the channel dimension.
PixelShuffle and PixelUnshuffle are specific implementations of "depth-to-space" and "space- to-depth" operations. These can be explained in their generalised form as follows. The pseudocode below illustrates a depth-to-space operation: Input Tensor: The operation takes an input tensor of shape [^^^^^^^^ℎ^^^^^^^^, ^^ ∗ ^^2, ^^, ^^], where C is the number of channels, r is the upscale factor, and H and W are the height and width. Rearrangement of Data: It rearranges the elements of this tensor to form a new tensor of shape [^^^^^^^^ℎ^^^^^^^^, ^^, ^^ ∗ ^^, ^^ ∗ ^^]. This rearrangement involves redistributing the elements from the depth (channels) into the spatial dimensions (height and width). Upscaling Effect: The result is an upscaling of the image or feature map by a factor of r, with each pixel in the original tensor contributing to a block of pixels in the output tensor. The pseudocode below illustrates a space-to-depth operation: Input Tensor: the input of a space-to-depth operation is a tensor of shape [^^^^^^^^ℎ^^^^^^^^, ^^, ^^, ^^]. Rearrangement of Data: Space-to-Depth rearranges elements to produce a new tensor of shape [^^^^^^^^ℎ^^^^^^^^, ^^ ∗ ^^2, ^^/^^, ^^/^^]. It does this by taking blocks of pixels from the spatial dimensions and stacking them in the channel dimension. Downscaling Effect: This leads to a reduction in the spatial resolution by a factor of r, while increasing the depth (channels) of the tensor.
A problem with all of the above approaches is that they involve a very large number of small parallelisable operations. These can be very efficiently performed in parallel on GPUs but cause bottlenecks and dramatic decreases in performance on CPUs or other more resource-constrained hardware. The present inventors have realised that both depth-to-space and space-to-depth operations used in upsampling and downsampling can be replaced wholly with convolutional operations. This is made possible due by virtue of the realisation that the change in dimensions from the key "rearrangement of data" steps of depth-to-space and space-to-depth can be achieved by one or more convolutional operations, for example performed by corresponding one or more convolutional layers in a neural network. Convolutional operations are typically already optimised at the chip level in commonly available commercial hardware chips on laptops, tablets and smartphones (such as the M2 and M3 Apple chips, and the Qualcomm snapdragon chips, and the Meteor Lake Intel chips, and others). Accordingly, replacing depth-to-space and space-to-depth with convolutions allows upsampling and downsampling to be performed far more efficiently than traditional depth-to-space and space-to-depth operations. Whilst the above improvement is envisaged to be used in the context of, for example super- resolution such as that described in concept 1 such as in the upsamplers and/or downsamplers in Figures 5 to 8. It is generally applicable to any instances where upsampling or downsampling might be performed in a neural compression pipeline. For example, one or more layers in the first or second neural networks of Figures 1, 2, 11 or 14 may be configured to downsample or upsample an input. Or there may be an intermediate layer within these networks, or
modules (not shown in the Figures) positioned throughout the pipeline that may perform downsampling or upsampling. In each of these cases it is envisaged that these downsampling or upsampling operations may be performed without depth-to-space or space-to-depth approaches, but with convolutional operations instead. Alternatively, there may be some combination of depth-to-space or space-to-depth in some instances but convolutional operations in others. For example, the flow module 206 in Figure 3 may comprise one or more layers or modules configured to downsample the input. This is one way to speed up runtime as the flow often need not be estimated in as high a resolution as the input image as the output quality of reconstructed images created with high resolution flow can be similar to that of the quality of reconstructed images created with a low resolution flow. This downsampling, if performed using traditional space-to-depth would be a bottleneck. However, replacing space-to-depth with one or more convolutional operations removes this bottleneck due to the convolutional operations being faster and better optimised for hardware-constrained platforms. The corresponding inverse upsampling may then be applied at the output of the flow module 206. again using convolutional operations rather than depth-to-space. A corresponding set of operations may be performed with the residual module of Figure 3, in any of the modules of the hypernetwork in Figure 2, and/or in any of the compression Pipeline of Figure 1. An exemplary implementation of mimicking space-to-depth (i.e. downsampling) may be as follows.
CONVOLUTIONAL LAYER SETUP: Kernel Size: The kernel size is envisaged to match the block size that is to be mimicked. For example, if the block size for space-to-depth is r (say, 2 for a 2x2 block), the corresponding convolution kernel size would be r x r (2x2 in the above example). Stride: The stride size is envisaged to equal the block size (r). This ensures that the convolutional filters move across the image in steps equal to the block size, effectively capturing the spatial blocks. Number of Filters: It is envisaged that the number of filters is set to ^^ ∗ ^^2, where C is the original number of channels. This ensures that each filter produces an output that corresponds to one depth level in the Space-to-Depth transformation SEQUENTIAL CONVOLUTION LAYERS: To fully replicate Space-to-Depth, it is envisaged that a number of convolutions may be used sequentially, e.g. by using a series of convolutional layers. This allows for the handling of where the channel increase (^^^^^^^^ ∗ ^^2) is significant. Each layer progressively accumulates more spatial information into the depth dimension. For completeness, it is also possible to replicate space-to-depth with a single strided convolution. Splitting it into multiple convolutions with activations between them provides additional expressive power, but isn’t needed to just replicate the functionality of space-to-depth. ACTIVATION FUNCTIONS:
It is also envisaged that one or more activation functions may be applied between sequential convolution layers. These help in spreading out the information spatially. An exemplary activation function may be ReLU, which introduces a non-linearity and helps in learning spatial patterns. CHANNEL REARRANGEMENT: Finally, after these convolutional operations, the output channels may optionally be rearranged to match the order that a space-to-depth operation produces. Alternatively, this can be done at any point in the process, and need not be done "on the fly". An exemplary implemenation of mimicking depth-to-space (i.e. upsampling) may be as follows: CONVOLUTIONAL LAYER SETUP: Kernel Size: A kernel size is envisaged that aligns with the desired spatial expansion. For example, if the upscale factor is r, a larger kernel size (like 3x3 or larger) can be more effective in spreading out the information across a larger spatial area. Alternatively, the depth-to-space operation can be implemented with a single transposed convolution with stride equal to the upsampling factor. It is also possible to make a more expressive process by using larger kernel sizes and/or splitting a more extensive upsample into multiple stages. Stride: It is envisaged that the stride may be set to 1, ensuring a uniform spread of information. Or the stride may be equal to the upsampling factor, as described above.
Number of Filters: This may be less than the original number of channels to reduce the channel dimension gradually. The exact number can vary depending on the architecture and desired output. SEQUENTIAL CONVOLUTION LAYERS: Multiple convolutional layers may be advantageous, especially if the change from depth to spatial dimensions is significant. Each layer can gradually increase the spatial dimensions and reduce the depth. ACTIVATION FUNCTIONS: It is also envisaged that one or more activation functions may be applied between sequential convolution layers. These help in spreading out the information spatially. An exemplary activation function may be ReLU, which can introduce non-linearity and help in learning spatial patterns. UPSAMPLING LAYERS: Alongside the convolutional layers, upsampling layers (like nearest neighbor or bilinear upsampling) can be used to increase the spatial dimensions. These can be alternated with convolutional layers to progressively achieve the desired spatial expansion.
The above exemplary implementation is illustrative only and is not intended to be limiting. For example, any suitable kernel size, stride and filter numbers are envisaged, as are the number of optional sequential convolution layers, activation functions and other steps. By way of illustration, Figure 9 shows an example sequence of layers of a neural network which takes an input image and downsamples it. The sequence of layers comprises a 3x3 Conv layer, a ReLU activation function, a space-to-depth (2x) operation, a 1x1 conv layer, another ReLU activation layer and finally a depth-to-space (3x) operation. This approach uses space-to-depth and depth-to-space. Implementing this sequence of layers in resource constrained environments e.g. on a CPU results in bottlenecks. In contrast, Figure 10 illustratively shows an example sequence of layers of a neural network or compression pipeline, for example one or more of the neural networks or compression pipelines shown in any of Figures 1 to 8 but where the space-to-depth and depth-to-space operations have been replaced by convolution operations. For example, the depth-to-space replacement comprises a strided transposed convolution and the space-to-depth replacement comprises a strided convolution. This implementation substantially reduces the bottlenecks when running in resource-constrained environments such as on a CPU. Concept 3: Regional Kernel Absolute Deviation (RKADe) for Flow In image processing, Mean Absolute Difference (MAD) is a technique for detecting and numerically estimating differences between pixels and/or pixel patches, that is, differences
between the values of a one or more pixels in one image and the values one or more pixels in another image. In the context of AI-based compression pipelines, MAD may be used in the estimation of cost volumes in order to estimate flow as part of a flow-residual compression pipeline such as that shown in Figures 3, 11 and 14. Figure 3 has already been described above. Figure 11 illustrates an example of a flow module, in this case a network 1100, configured to estimate information indicative of a difference between an image ^^^^−1 and an image ^^^^ , e.g. flow information. Figure 11 is provided as a non-limiting example of how such flow information may be calculated between two images. The flow module may be used in or together with the flow module part 206 (Figure 3) of the compression pipeline. An alternative approach to estimating flow is described in Mentzer, F., Agustsson, E., Ballé, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural video compression using gans for detail synthesis and propagation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI (pp. 562-578), which is hereby incorporated by reference. The network 1100 in Figure 5 comprises a set of layers 501a, 501b respectively for an image ^^^^−1 and an image ^^^^ from respective times or positions in a sequence ^^ − 1 and ^^ of a sequence of frames. The set of layers 1101a, 1101b may define one or more convolution operations and/or nonlinear activations (for example as described in concept 2 above) to sequentially downsample the input images to produce a pyramid of feature maps for different levels of coarseness or spatial resolution. This may comprise performing ℎ/2 ^^/2 downsampling
in a first layer, ℎ/4 ^^/4 downsampling in a second layer ℎ/8 ^^/8 downsampling in a third layer, ℎ/16 ^^/16 downsampling in a fourth layer, ℎ/32 ^^/32 downsampling in a fifth layer, ℎ/64 ^^/64 downsampling in a sixth layer, and so on. It will of course be appreciated that these downsampling operations and levels of coarseness or spatial resolution of a pyramid feature map are exemplary only and other layers, operations and levels are also envisaged. For example, other operations, not only those from concept 2 above, may be used. With the downsampling operations performed and the corresponding pyramid of feature maps generated, a first cost volume 1102 is calculated at the most coarse level between the feature map pixels of the first image ^^^^−1 and the corresponding feature map pixels of the second image ^^^^ . Cost volumes define the matchmaking cost of matching the pixels in one image with the pixels in a second image (which may be later in time, or earlier in time, for example due to the order in which B-frame processing typically occurs which is not necessarily the chronological order of the frames). That is, the closeness of each pixel, or a subset of all pixels, in the initial image to one or more pixels in the later image is determined with a measure of similarity such as a vector or dot product, a cosine similarity, a mean absolute difference, or some other measure of similarity. This metric may be calculated against all pixels in the later image, or only for pixels in a predetermined search radius such as a 1-10 pixel radius (preferably a 1, 2, 3, or 4 pixel radius), or some other radius as described in connection with concept 4 below, around the pixel coordinate corresponding to the pixel against which the closeness metric is being calculated. This process is computationally expensive in floating points space but which can be implemented efficiently in integer or fixed point space.
Once the first cost volume 1102 at the most coarse level is estimated, a first flow 1103 can be estimated from the first cost volume 1102. This may be achieved using, for example a flow extractor network which may comprise a convolutional neural network comprising a plurality of layers trained to output a tensor defining a flow map from the input cost volumes. Other methods of calculating flow information from cost volumes will also be known to the skilled person. The same process is then repeated for the other levels of feature map coarseness to calculate a second cost volume 1104 and second flow 1105, and so on for the cost volumes and flows associated with each of the levels of coarseness until they have all been calculated, up to the final cost volume 1106 and flow 1107. The weights and/or biases of any activation layers in network 1100 (e.g. optionally in the downsampling convolution layers and/or in a flow extractor network that produces flow maps from the cost volumes) are trainable parameters and can accordingly be updated during training either alone, or in an end to end manner with the rest of the compression pipeline. The trainable nature of these parameters provides the network 1100 with flexibility to produce feature maps at each level of spatial resolution (i.e. pyramid feature maps) and/or at the flow outputs that are forced into a distribution that best allows the network to meet its training objective (e.g. better compression, better reconstruction accuracy, more accurate reconstruction of flow, and so on). For example, it allows the network 1100 to produce feature maps that, when cost volumes and/or flows are calculated therefrom, produce cost volumes or flows that are distributed roughly matching the latent representation distribution that would previously have
been expected to be output by a dedicated flow encoder module. This effectively allows a dedicated flow encoder to be omitted entirely from the flow compression part of the pipeline. Optionally, for each level of coarseness or resolution, the flow of the previous level or levels of coarseness or resolution may be used to warp 1108, 1109, the feature maps before the cost volume is calculated. This has the effect of artificially reducing the amount of relative movement between the pixels of the t and t - 1 images or feature maps when calculating the cost volumes, reducing flow errors for high movement details. Optionally, removing warping entirely or in some levels of coarseness or resolution can substantially reduce run-time of flow calculation while maintaining good levels of flow accuracy. As the warping process uses inputs from different levels of coarseness or spatial resolution, the flow estimation output may be upsampled 1110, 1111 (for example using the methods of concept 2, or using any other upsampling method) first to match the coarseness resolution of the feature map to which the flow is being applied in the warping process. The outputs of the flow module may accordingly be one or more cost volumes or some representation thereof, and/or one or more flows or some representation thereof). The flow or representation thereof may then be transmitted in a bitstream and decoded by a flow decoder, the output of which may in turn be used to warp a previously decoded image for use in the residual encoder/decoder 1110 arrangement as shown in e.g. Figure 3.
Returning now to the estimation of cost volumes, the cost volumes may be used to compute local (translational) alignment through patch-wise comparisons. For example, let ^^, ^^ ∈ R^^×^^×^^ be two tensors each with ^^ ∈ N channels and spatial dimensions ^^ × ^^ ∈ N2. A standard cost-volume to compute is CV(^^, ^^)^^, ^^ := ^^^(:,^^) , ^^ (:,^^+ ^^)^, where the subscript denotes
, ^^ ∈ [^^] × [^^] and ^^ = ( ^^1, ^^2) ∈ {−^^, . .. , ^^}2 with ^^ ∈ N being the radius of the cost volume. The function above ^·, ·^ is the inner product and means that the cost volume is constructed by computing the channel-wise correlation. MAD may be used in the calculation of cost volumes as follows. Let P : R^^×^^×^^ → R^^×(2^^+1)2×^^×^^ be a patch operator that associates to each pixel ^^ ∈ {^^}
centered at pixel ^^ for an integer ^^ ≥ 0. Then the MAD cost volume is defined by ∑ ^^^ cv (^^, ^^) := ∥P(^^) − P( 2 mad ^^, ^^ ^^,^^ ^^)^^,^^+ ^^ ∥1, ^^ ∈ {^^} × {^^}, ^^ ∈ {−^^, .. . , ^^} .
Exemplary r values that may be used include r = 1 (giving comparisons of 3x3 patches), r = 2 (giving comparisons of 5x5 patches), r = 3 (giving comparisons of 7x7 patches), and more. The MAD-based cost volume estimation is more computationally efficient than other known cost volume estimation methods and accordingly synergistically helps to reduce run times of the flow estimation of an AI-based compression pipeline. However, the present inventors have
found that implementing the operations used to perform MAD calculations at the machine level has a number of downsides, particularly when the MAD calculations are implemented using convolutions. Specifically, an input tensor of arbitrary ^^^^^^^^^^ dimensions is typically stored in non- contiguous blocks of memory. For example, in one example, the values of the elements of ^^^^ and ^^^^ for one channel ^^^^ of an input tensor may be stored in a first block of memory while the values of the elements ^^^^ and ^^^^ for another channel ^^^^ may be stored in a second block of memory and so on. When a MAD value is estimated that involves this multi-channel input tensor, the values of the elements of ^^^^ and ^^^^ are accessed from one of the memory blocks, the values of the elements of ^^^^ and ^^^^ are accessed from another of the memory blocks, and so on. This means the number of memory access operations can be very high even for relatively simple operations. Consider for example the following pseudo code for implementing the above-described MAD approach may be as follows: 1) Receive input tensor input_1 representing first image, and input tensor input_2 represent- ing second image. 2) Apply repeated interleaving of input_1 to obtain tensor x of desired shape: x = repeat_interleave(input_1). 3) Apply flattening and unfolding of input_2 to obtain tensor y of desired shape: y = flatten_unfold(input_2).
4) Calculate absolute difference between tensors x and y: absolute_differences = absolute_difference_calculation(x, y). 5) Sum the absolute differences with a strided sum operation: mad_output = strided_sum(absolute_differences). In the above example, the strided sum operation comprises depth or channel-wise grouped convolutions (i.e. convolutions applied in depth-wise groups, each group taking as inputs the values of spatial dimensions ^^ and ^^ stored in non-contiguous memory blocks). That is, the stride is in the depth (i.e. channel) dimension necessitating the access and retrieval of the values of the elements of ^^ and ^^ stored in separate memory blocks. Alternatively, if the grouping of the convolutions is in the spatial dimensions rather than the depth dimension, the non-contiguous memory block problem still arises but now in the depth dimension. In other words, the use of grouped convolutions results in a convolution-based MAD approach that has a memory access bottleneck during run time. Accordingly, implementing MAD-based, naive, patch-wise comparisons using convolutional operations on most commercial hardware CPUs, GPUs and NPUs (i.e. neural accelerators) is slow due to the interleaving of data in memory that results from the order in which the operations of the convolutions are performed. A schematic of this strided sum based implementation of a MAD cost volume calculation is shown in Figures 12a, 12b and 12c. Illustrated in Figure 12a is a toy representation of input tensors 1200a, 1200b respectively associated with a first image and a second image. Each
input tensor has 3 channels: channel 1 (Ch1), channel 2 (Ch2) and channel 3 (Ch3). In a first step, repeat interleaving 1201 is performed on the first image input tensor 1200a to produce a first intermediate output. In a second step, an unfold convolution operation 1202 is performed on the second image input tensor 1200b to produce a second intermediate output. The unfold convolution operations are performed as group convolutions and accordingly each group is assigned its own memory block. In a third step 1203, an absolute difference is estimated between the intermediate outputs of the first step and the second step to produce a third intermediate output. As is shown in Figure 12b, the elements of the third intermediate step are stored in said respective, different memory blocks - in this case memory blocks 1, 2 and 3 (Mb1, Mb2, Mb3) to match the number of outputs Finally, a strided sum 1204 is performed on the estimated absolute differences stored in the respective memory blocks 1, 2, and 3 to produce the MAD output cost volume tensor. Again by virtue of the group convolutions of the strided sum 1204 and by virtue of the memory block locations, this operation requires non-contiguous memory blocks to be accessed for each of the convolutions of the strided sum 1204, as is illustrated in Figure 12c. Whilst it would in principle be possible to introduce a shuffling of the layers of the input tensor earlier in the flow, for example before the unfold convolutions and/or before the strided sum convolutions, this introduces an additional shuffling operation which is also slow given the number of memory read and write operations required to complete a full interleaving or de-interleaving process, and can also further complicate any other parts of the AI compression pipeline that rely on this information in unshuffled form. The additional shuffling operation to resolve the downstream issue is accordingly undesirable and does not result in appreciable run-time improvements.
To address these and other problems, it is helpful to first consider in more detail the purpose of the cost volume in the presently described AI-compression pipeline. Specifically, the goal of the cost volume is to construct a spatial comparison operator that encodes some notion of how two patches are related. For a given pixel ^^, the cost volume thus has a measure of comparison between ^^ and ^^ for a collection of local offsets ^^ . The present inventors have realised that any effective encoding of this information can suffice in AI-compression pipelines because the neural networks that make up the flow encoder and/or decoder and residual encoder and/or decoder, and indeed any other neural networks that make up the AI-compression pipelines of Figures 1-14, are able to learn to accommodate the encoding of this information, regardless of how it is represented. It cannot be understated how significant of an advantage this is for end-to-end AI-based compression pipelines over traditional compression methods. In particular, this facilitates the use of approaches to estimating cost volumes that simply aren’t viable in traditional compression pipelines. Presently described concept 3 is directed to such approaches, which are described herein as compressive approaches. That is, the use of a compressive encoding of the measure of comparison between ^^ and ^^ for a collection of local offsets ^^ , is made possible and in particular a compressive cost volume estimation (i.e. a compressively encoded cost volume) is made possible. In traditional, local cost volume approaches, a pixel is compared to pixels in a given reference image that lie within a given radius. In the context of flow estimation, this may be the comparison of a pixel in a first image to pixels in a radius around a corresponding pixel coordinate in a second image. For example, in a radius ^^ local cost volume, one must compare
a pixel to (2^^ + 1)2 reference pixels (e.g., for radius 3 there are 49 reference pixels in a 7x7 block). In classical, local cost volume approaches a version of this mapping might be: ^ ^^^^( ^^) = ^ ^ ^^^^ − ^^^^+ ^^^ . This approach effectively comprises a naive patch-wise comparison of the two images. An inductive bias for structured data suggests that the above map is approximately sparse in a basis, meaning that the data can be equivalently represented in a low-dimensional subspace approximately logarithmic in the dimension of the ambient space. Thus, a compressive encoding of the cost volume may look like: ^^ · ^^^^( ^^), where ^^ is an appropriately chosen random matrix with ^^ ∈ R^^×^^ satisfying ^^ ∼ O(log ^^).
From this, the present inventors have realised that it is possible to build a learned mapping that replaces the above-described classical way, naive patch-wise comparison approach of estimating local cost volumes that require large number of operations. The learned mapping effectively bypasses the direct computation of the local cost volume and instead computes the lower dimensional compressively encoded cost volume ^^ · ^^^^( ^^) directly. This compressively encoded cost volume, or a representation thereof produced by a post-processing step, contains substantially the same information as a traditional cost volume but this information is provided in a lower dimensional representation that can still be passed to any subsequent, downstream components of the AI-compression pipeline that relies on cost volumes in the usual way. Examples of downstream components may include layers in the flow modules, the final
estimation of flow, the warping in or after the flow module, and so on. Given that these downstream components are agnostic as to how they receive the cost volume information as end to end training allows them to adapt to whatever form the information is received in, the compressively encoded cost volume facilitates the estimation of image differences far more efficiently than traditional cost volume calculation methods. This approach is described hereinafter as Regional Kernel Absolute Deviation (RKADe) and allows for any MAD operations in the AI-compression pipeline to be substituted by RKADe operations which mitigate the memory interleaving issue described above in connection with using MAD and naive patch-wise comparisons for cost volume estimation, and generally provide a way to more efficiently estimate differences between images. The inventors have found that run-time speed ups with very little additional optimisation were observed when RKADe was tested across a number of widely available commercial CPUs, such as the Mac M2 chip, and the Qualcomm Snapdragon chip. For example, in some instances, RKADe results in a greater than 50% runtime reduction on widely-used standard Intel processors, as compared to a purpose-built custom MAD cost volume implementation. At a general level, a realisation of the present inventors is that, where a mapping (e.g. an input, an operation applied to that input, and an output) is sparse in a basis (e.g. in the case of a representation of an image such as a flow or residual between two images represented by mostly zeros), the mapping can be almost identically represented in fewer dimensions. This facilitates the implementation of that mapping in a simplified manner. In the case of flow estimation in image sequences, there is often very little movement so cost volumes are
typically sparse in a transform of the spatial domain. This means that the calculation of cost volumes representing the difference between two images can be substantially simplified. For example, if a single pixel in one image is being compared to a pixel patch in a pixel radius of 3 around in a second image, this would entail 49 pixel comparison operations to obtain the cost volume associated with that pixel using traditional methods. It turns out that the vast majority of these pixel operations are related to redundant information when the input and/or output are sparse as there is little or no difference. In such circumstances, the cost volume can more efficiently be estimated by applying a fixed or learned map to the input to produce an identical or almost identical cost volume. Applying a map, e.g. a linear map through a number of convolution operations, is computationally efficient and fast and provides a lower-dimensional representation of otherwise the same information that would have been provided in a cost volume estimated using traditional methods. Further, the compressive cost volumes that are computed using the RKADe approach come with an additional, substantial run-time saving, because any downstream tensor operations (TOPs) that take place are in the lower dimensions of the compressive cost volume compared to the higher dimensions of the traditional cost volume; and none of the steps comprising RKADe require grouped convolutions, slow memory access or array operations like de-interleaving. An examplary pseudocode implementation of RKADe is provided below: 1) Receive input tensor x_t representing first image, and input tensor x_ref representing second image.
2) Generate a feature map of x_t: feature_map_x_t = feature_map(x_t) 3) Generate a feature map of x_ref: feature_map_x_ref = feature_map(x_ref) 4) Generate a region map of x_ref from the feature map of x_ref: region_map_x_ref = region_map(feature_map_x_ref) 5) Generate transform maps of x_t and region_map_x_ref to ensure same shape to facilitate direct comparison: transform_map_x_t = transform_map(x_t) transform_map_region = transform_map(region_map_x_ref) 6) Calculate absolute difference: RKADe_output = absolute_difference(transform_map_x_t, transform_map_region). Figure 13 illustratively shows the above steps of RKADe. As shown in Figure 13, an illustrative RKADe workflow on a toy example comprises four elements: a feature map F, a region map R, a transform map T, and an absolute difference Δ. The feature map F, defined by one or more layers comprising one or more filters defined by a plurality of weights, operates as a linear embedding of the input tensor into a feature
space. It is a local embedding in the sense that it may comprise a 1x1 convolution. Efficient implementations of 1x1 convolutions on a wide variety of commercial CPUs, GPUs and NPUs are known to the skilled person. However, unique in the case of RKADe is that the feature map operation filters may comprise random weights, and does not need to be trained (although it is envisaged that it may be trained in some circumstances). The feature map operation weights may be randomly distributed weights, and the operation relies on the favourable properties of high-dimensional random embeddings to preserve local geometry. Accordingly, the feature map F operation filters are instantiated using random weights with a normalisation that makes it an isometry in expectation. That is, the feature map operation filters comprise weights that apply a transformation that, on average, preserves geometrical distances of the distribution it is applied to. In other words, the feature map operation preserves a norm of the inputs in expectation. Thus, if a convolution defining a feature map is an isometry in expectation, the norm of the input to which the convolution is applied is preserved in the output (in a probabilistic sense). These fixed, random weights of F and the isometry in expectation property of F effectively mean that, during inference, estimating differences between two images comprises applying a convolution with the random weights that correspond to the random weights the convolution was initialised with, rather than weights modified in some way during subsequent training. An example of a suitable random distribution of weights of F is any suitably initialised sub-gaussian distribution. Here suitably initialised means initialised based on the shape of the input and/or output tensors. The region map R operation comprises a composition of 3x3 convolutions, optionally with intermediate non-linearities such as one or more ReLU maps or activations. The purpose of the
region map R is to entangle, in an output pixel, the information present in a local patch about the same pixel in the input (reference) image. Here entangling information means combine information. The "radius" of the local patch is determined by the number of convolutions in R (e.g., 3 convolutions gives a patch radius of 7). Because R is comprised of 3x3 convolutions and optionally simple non-linearities, it is possible to use known, efficient 3x3 convolution implementations to run efficiently on a wide variety of commercial CPUs, GPUs and NPUs. As above with the 1x1 convolutions, R is instantiated using a weight normalisation that makes it an isometry in expectation. The transform map T operation serves as a post-embedding or post-processing of the feature embedding and the entangled patch information, permitting the effective comparison of the two. This transform map T operation can be a simple 1x1 convolution, thereby being local, linear, fast, and efficient to implement across a wide variety of commercially available CPUs GPUs and NPUs. As above, T may be instantiated using a weight normalisation that makes it an isometry in expectation. In some implementations, the weights of the transform map T may correspond to those of the feature map F. Further, the number of input channels for F can naïvely be set to any positive integer. The number of output channels of F may match that of R and hence the number of input and output channels of R must be the same. The number of input channels of T do not need to match its number of output channels. Note for completeness that the number of input channels of F can be set naïvely because if the mathematical relationships of RKADe are to hold in practice, it is envisaged that there is sufficient dimensional relationship between the pixel radius (encoded by the number of layers
of R) and the number of output channels of F. This ensures the shape of the objects being compared match when the absolute difference is subsequently calculated. As shown in Figure 13, the feature map F is applied to a first input tensor representation 1300a of a first image, for example a current frame ^^^^ of a sequence of images, and a second input tensor representation 1300b of a second image for example a previous frame ^^ref of the sequence which may contain movement relative to the first image. Each input pixel of the first input tensor representation 1300a, a comparison will be made with pixels at coordinates around a predetermined radius of that coordinate in the second input tensor 1300b. An illustrative toy example of a 1 pixel radius around a center pixel coordinate is indicated with the dotted borders in Figure 7. The feature map convolution operation F is applied to the pixels of the first input tensor 1300a and the associated patches of the predetermined radius in the second input tensor 1300b. The region map R convolution operation is then applied to the output of the feature map convolution operation on the second input tensor 1300b. A transform map T convolution operation is then optionally applied to the intermediate outputs, before an absolute difference is estimated, resulting in the RKADe cost volume tensor. It will be appreciated from Figure 13 that none of the feature map F convolution, region map R convolution or transform map T operation require grouped convolutions. Accordingly, the intermediate outputs may be easily stored in contiguous memory blocks without requiring a large number of memory read and write operations to interleave or de-interleave the data in memory. As a result, cost volume estimations in the RKADe approach are substantially sped up compared to traditional, naive patch-wise comparison approaches.
It is also noted that F, R and/or T may be kept fixed (e.g. F’s weights may be fixed and randomly distributed between minimum and maximum values, as described above), or may be trained. Keeping the maps fixed facilitates straightforward deployment by substituting any naive patch-wise comparisons in AI-compression pipelines, (e.g. a large number of MAD-based operations). However, when fixed, the expressivity of the maps may be reduced and thus overall accuracy of this approach may be reduced. Conversely, training F, R and/or T increases expressivity of RKADe but introduces greater complexity in training and inference pipelines and can adversely affect training stability of an end-to-end trained AI compression pipeline by virtue of the introduction of a further trainable element. The choice of using fixed or trained F, R and/or T may accordingly depend on a complexity-accuracy-run-time trade off for a given application. In exemplary embodiments, it is envisaged that the weights of R and T are trained or learned, whereas the weights of F are random and fixed. Combining these components together to get the absolute difference Δ that defines RKADe, let ^^, ^^ ∈ R^^in×^^×^^ and let F and T be convolutions with kernel sizes [^^out, ^^in, 1, 1] and
comprising R have kernels with size [^^out, ^^out, 3, 3]. Mathematically, one may write RKADe as RKADe(^^, ^^) := |T (F^^ − R(F^^)) | . Above, note that F is a linear map operating pixel-wise (i.e. performing the same operation on each ^^in-dimensional pixel); that R also acts pixel-wise, but insofar as each pixel is represented by a patch of a given radius that has been induced by the number of convolutions used to construct R; and that T also acts linearly and pixel-wise on each ^^out-dimensional pixel. By virtue of its construction using elementary convolutional components, RKADe runs on hardware
without the requirement of grouped convolutions, and thus solves the slow memory access or array operations like de-interleaving of cost volume approaches such as naive patch-wise MAD. As alluded to above, F, R and/or T may be individually or separately trainable, for exam- ple by setting a "requires_grad_" PyTorch or similar flag in their respective code-level implementations to permit backpropagation through them. However, it is envisaged that F is generally kept fixed while R and/or T are trainable. In this case, they may be trained in an end-to-end manner with the rest of the pipeline, whereby the weight values are simply additional parameters, on top of the other parameters of the pipeline, that may be updated during back-propagation. More specifically, it has been found that end-to-end training requires no special auxiliary loss terms to guarantee stability during training. Indeed, advantageously, the F, R and T maps are "plug-and-play" onto the rest of the AI compression pipeline during training and subsequent inference. Optionally, student-teacher training of weights of the (optionally F), R and T maps is also effective and achieves good training stability out of the box without difficulty. Multiple approaches to student-teacher training are envisaged. For example, the teacher may be set up to push the teaching towards a fine-grained level that represents similar features to classical cost volumes, or at a less granular level where the teacher comprises a flow network with MAD cost volumes, and the student comprises a flow network with RKADe cost volumes, with the loss based on the difference between these two. Alternatively, the teacher may be set up to push the training towards some other objective and may be set up accordingly.
More generally, if there are multiple training stages of the compression pipeline (e.g. pre- training and main training), the weights of F, R and/or T may be frozen at different times during these stages by setting the "requires_grad_" flag appropriately at different training steps. By way of illustrative example, consider a scenario where F is fixed, and R and T are learned, R and T may be frozen with initialisation weights during the initial pre-training phase of the compression pipeline before being unfrozen and trained during the main training phase. This approach ensures that the weights of R and T are being updated based on a stronger training signal from the rest of the neural networks of the compression pipeline in order to decrease overall convergence time, thereby speeding up training. As described above, the distribution (i.e. values) of the initialisation weights of F, R and/or T may be random, based on some predetermined distribution, or based on prior knowledge obtained from experimentation to provide a warm-started signal from the rest of the model at the point when one or more of F, R and/or T unfreeze and become trainable. In one illustrative example, the initialisation weights may be initialised using any appropriately normalised sub-gaussian distribution producing a map that is an isometry in expectation (for example, a Gaussian distribution, a truncated Weibull distribution, or a uniform distribution). In some embodiments, it is envisaged that, during training, the property of being an isometry in expectation is maintained as the weights of F, R and/or T as applicable are adjusted. This property may be enforced using Jacobian regularisation, such as that described in concept 5 below. Alternatively, the isometry preserving property may only be retained upon initialisation,
and training will gradually eliminate that property as the weights converge to some final values. Finally, it is also envisaged that optionally some of the weights of F, R, and/or T may be kept fixed, while others may be trained, for example to avoid significant departure from the isometry preserving property during training. In an illustrative toy example, the weights of F, R and/or T may be initialised by generating randomly distributed values uniformly distributed between a minimum value and a maximum value, whereby the minimum and maximum value are based on a number of input or output channels on which F, R and/or T are applied, based on a kernel size of F, R and/or T and/or on a pixel radius across which the RKADe cost volume is to be calculated (e.g. a 3 pixel radius resulting in a comparison against a 7x7 pixel patch centered on a given pixel coordinate). In other words, the minimum and maximum values are based on the dimensions of the input tensor. For example, an initialisation weight distribution can be defined as: weight_distribution = uniform_noise(−^^, ^^), where −^^ is the minimum value and ^^ > 0 is the maximum value of the uniform random weight distribution defined by: √^ ^^ = input_channels output_channels · kernel_dimension_one · kernel_dimension_two . A filter comprising weights distributed as above preserves norm in expectation of the input to which it is applied. As described above, optionally, Jacobian regularisation may be applied during training to ensure the isometry in expectation is preserved even as the weights are updated. Alernatively, the weights may be free to lose this property if the training results in them doing so. Of course, if any of F, R, and/or T are fixed, then the property will be preserved
naturally as the initalised weights do not change. It is particularly envisaged that F may be fixed in this way, while the weights of R and/or T may be trained. Optionally, the training regime may be implemented with one or more switches that specify exactly when during training to freeze or unfreeze the trainable parameters of F, R and/or T based on some predetermined one or more conditions (such as a number of iterations or a loss threshold, or other conditions). Where one or more of the F, R and/or T maps are fixed and, for example, there are no non-linearities such as ReLUs in any of F, R and/or T, the weights are initialised as described above e.g. with random weights for R and/or experimentally determined weights for F and T, and there are no special loss or training considerations that need to be taken into account as the maps are simply linear mappings. Whilst concept 3 has been described in the context of using spatial comparisons to estimate flow between two images in an image sequence as part of a flow-residual-based AI compression pipeline, it is envisaged that it may be used anywhere where spatial comparisons are calculated, both in and outside of the neural image and video compression domains, as is described in more detail below in concept 4. Concept 4: Regional Kernel Absolute Deviation (RKADe) for General Computer Vision and Image Processing Tasks
Spatially comparing two images is widely performed across various technical domains. Specifically, naive patch-wise comparisons (using any measure of similarity) are slow on typical commercial hardware. For example, if naive patch-wise comparisons using MAD operations are used and implemented as convolutions on existing commercial CPUs, GPUs and NPUs, the patch-wise comparisons introduce a bottleneck to run time. This is particularly problematic for technical domains where low-latency and fast run times are critical to functionality. Accordingly run-time advantages are realised for any use case that is presently implemented with patch-wise comparisons (using any measure of similarity), for example any use cases currently implemented as a MAD-based patch-wise comparison. In contrast, RKADe is faster than naive patch-wise comparison operations, in any computer vision and/or image processing task because it is a compressive approximation of such a patch-wise comparison. A first example use case where run-time improvements may be realised with RKADe is in the generation of bounding boxes around image patches where movement is to be detected. Consider a first image at a first time and a second image temporally separated from the first image. The objects in the second image have moved relative to their positions in the first image. In computer vision tasks such as surveillance, satellite image comparisons, drone navigation, and others, detecting such movement is a common task as it facilitates tracking of objects across different views and across time. One approach to such detection is to generate a bounding box around objects whose pixels differ between the first and second images.
One approach to generating such bounding boxes is to divide the first and second images into grids, and to compare the pixel values in the first image to individual pixels or groups of pixels in the second image. One way to make this comparison is to use MAD implemented as convolutions. If the MAD for a given pixel or patch of pixels exceeds a threshold, that pixel or patch of pixels may be identified as a movement-containing patch (whereas any that don’t exceed the threshold may be identified as static patches). The boundaries of one or more bounding boxes may then be generated that encompass some or all of these movement-containing patches and used to identify the moving object across frames. As described above, a naive patch-wise comparison implemented as convolutions are run-time bottlenecks. Accordingly, applying RKADE approach from concept 3 to the bounding box task: RKADe(^^, ^^) := |T (F^^ − R(F^^)) | . where x is the first image, y is the second image, and T, F and R are as described in connection with concept 3. This approach to detecting moving objects facilitates the running of object detection locally and in real-time on end devices such as onboard of a camera-based surveillance system, a drone such as a small quadcopter, and other hardware platforms that typically do not have access to powerful processor resources, or where power draw is at a premium due to hardware constraints such as battery life.
More generally, the bounding box generation approach may also be used in image and video compression pipelines including both traditional and AI-based compression pipelines. In this case, it may be used to facilitate partial frame-skipping to further reduce the amount of data that needs to be sent to reconstruct an image or image sequence. For example, if objects in two images have hardly changed except for a small number of pixels or pixel patches, the above-described bounding box generation approach may be used to identify and extract those movement-containing patches. Only these movement-containing patches then need to be compressed and transmitted in order for the full image sequence to be accurately reconstructed. That is, on the decode side, the original image sequence can be constructed efficiently by stitching together previously decoded static image patches with the newly received movement-containing patches. As described above, applying RKADe to estimating differences between pixel patches instead of a naive patch-wise comparison facilitates a substantial run-time improvement. Other non-limiting use cases where applying RKADe instead of naive patch-wise comparison results in run time improvement include: Image Registration: In medical imaging or remote sensing, images taken at different times or from different sensors need to be aligned or "registered" to each other. Here the alignment or registering of images to each other with naive patch-wise comparison (for example where MAD is used as a similarity metric to align these images accurately by finding the transformation
that minimizes the average absolute intensity differences between them) can be replaced by the present RKADe approach for run-time improvements. Stereo Vision and Depth Estimation: When calculating depth from stereo images, naive patch-wise comparisons using MAD are typically used to compare corresponding patches in the left and right images. The disparity (difference in horizontal position) that minimizes the MAD is often chosen as the correct match, which is then used to compute depth information. Here, the replacement of the naive, MAD-based patch-wise comparison results in run-time improvements in calculating depth. Template Matching: In object detection and computer vision, template matching involves sliding a template image over a target image to find the region that best matches the template. A naive, patch-wise comparison, MAD-based is typically used as a measure to find the location where the template and the target image have the least absolute difference, indicating a potential match. Again, replacement of this approach with RKADe results in faster matching times. Noise Reduction: In image denoising, MAD can be used to compare the local neighborhood of pixels. Filters like the median filter or adaptive filters use MAD to determine the level of noise in a local patch and to adjust the filtering strength accordingly to reduce noise while preserving details. This initial determination of noise levels in a local patch can be achieved faster by applying the RKADe approach. Quality Assessment: For quality control in manufacturing, naive patch-wise comparisons are be used to compare images of a product against a standard reference image. Differences beyond
a certain threshold can indicate defects or deviations from the desired quality. Again, this may instead be implemented with RKADe to provide run time speed ups and the facilitation of running quality control algorithms on edge devices that do not have significant computing power. Change Detection: In satellite imagery analysis, naive patch-wise comparison can be used to detect changes over time by comparing pixel intensities of the same location across different dates. This is useful in monitoring urban development, deforestation, or the effects of natural disasters. The use of RKADe instead of a naive patch-wise comparison faciliates the running of such change detection algorithms more quickly and thus allowing real-time change detection on resource-constrained devices. Photogrammetry: In reconstructing 3D models from 2D images, naive patch-wise comparisons can be used to ensure that the matching of pixels across multiple images is accurate, which is crucial for generating a reliable 3D representation. Again, the use of RKADe to replace the naive patch-wise comparisons results in faster run times, allowing phogrammetry systems to reconstruct 3D models in real time on smaller, resource constrained devices. In each of these cases, the granular detail that naive patch-wise comparisons provide about the differences between pixel intensities can be obtained more efficiently, faster, and on resource-constrained devices by using an RKADe-based approach instead.
Concept 5: Jacobian penalty for temporal sequence modelling Neural network training stability refers to how well convergence of a neural network’s learning process progresses during training. Training instability manifests as large fluctuations in learning performance, where, during training, the model’s loss and/or validation curves significantly vary, or even fail to improve at all, despite using the same data and training parameters. Some non-limiting factors that influence training stability include the choice of the optimization algorithm (e.g. stochastic gradient descent, momentum, adagrad, adam, and so on), the learning rate, the initialization of network weights, the network architecture, the quality and pre-processing of the input data and so on. Traditionally, to improve training stability, techniques such as batch normalization, gradient clipping, regularisation, and hyper-parameter tuning are applied. Finding an optimum approach using these techniques requires burdensome experimentation, hyper-parameter sweeps and ablation studies because, in many cases, an optimum approach to achieving training stability for one type of model, architecture, and data set may only work on that model, architecture and data set. A known regularisation technique is to introduce a Jacobian regularisation or penalty term to the loss function. A Jacobian regularisation term or penalty in the context of training neural networks is a method used to control or influence the behavior of the model by regularizing its sensitivity to input changes. The Jacobian matrix represents the partial derivatives of the
model’s outputs with respect to its inputs, effectively capturing how changes in the input affect changes in the output. In matrix form, these partial derivatives are the network’s Jacobian matrix. In order to control the behaviour of the loss function, a norm of the Jacobian matrix is calculated and added to the loss function. Thus if the input causes the network output to change significantly, the partial derivatives will have high values and the norm of the Jacobian matrix will be high, thus the penalty added to the overall loss will be high. As training progresses and the network learns to minimise the overall loss, it learns weights which avoid producing a Jacobian matrix which, when the norm is calculated, produces a high value. In machine learning generally, temporal sequence modelling is a challenging problem. In the context of AI-based image and video compression, this challenge manifests itself in the compression of image or frame sequences of a video. In low-motion sequences where there is significant information redundancy across frames, the networks of an AI-based compression pipeline such as that of Figure 3 and Figure 14, would learn to produce latent representations of inputs that contain only minimal information and are highly compressible and then re-use information form previously decoded frames when reconstructing the input. Conversely, in high-motion sequences the latent representations contain more information. It turns out that teaching these behaviours is extremely challenging. For example, the networks of a pipeline that performs well on high-motion frame sequences will not perform well on low-motion frame sequences and vice versa. In general terms, this can be understood as the networks not having learned that when two images ^^^^−1 and ^^^^ of a sequence are the same or similar, they can produce a substantially empty latent representation
to encode into the bit stream when compressing ^^^^ because all the information from ^^^^−1 can be re-used. Indeed it turns out that flow-residual architectures such as that shown in Figure 3 and Figure 14 struggle to perform well on low-motion frame sequences and this can be attributed to the challenge of modelling the temporal sequence of frames. Present concept 5 is directed to solving this problem by introducing a special type of Jacobian penalty term into the loss function. In more detail, let ^^ ≥ 1 be an integer. Denote by S^^−1 the (^^ − 1)-sphere. Denote by [^^] the ordered set {1, 2, .. . , ^^}. Let ^^ := ( ^^ (1) , .. . , ^^ (^^)) : R^^ → R^^ be an almost-everywhere continuously differentiable
X ⊆ R^^ is compact. The directional derivative of any component function ^^ (^^) in the (unit) direction ^^ ∈ S^^−1 at a point ^^ ∈ R^^ is given by ^^ ^^ (^^) (^^)
(^^ + ^^^^) − ^^ (^^) ^^ [ ^^ (^^)] (^^) := lim ^^↘0 ^^ When ^^ is a standard basis vector ^^( ^^) we call ^^^^ [ ^^ (^^)] (^^) the partial derivative of ^^ (^^) (in the direction ^^( ^^)) (at the point ^^)
^^ ^ (^^) ^^ [ ^^ (^^) (^^) ( ^^) ^ ^^ ] (^^) = ^∇ ^^ (^^), ^^ ^ = ^^^^ . The Jacobian matrix of ^^
whose (^^, ^^)-th element is ^^ ^^ ^^) := ^^ (^^) ^^ [ ^^ ] ( ^^ , (^^, ^^) ∈ [^^] × [^^] . ^^^^
Next, define the fixed point set of a mapping ^^ restricted to a (sub-)space X by Fix ^^ := {^^ ∈ X : ^^ (^^) = ^^}.
Define a (discrete, Markovian) temporal sequence model as a sequence of tuples (^^^^ , ^^^^ , ^^^^)^^ where ^^ ≥ 0 is an integer and where ^^^^ := ^^^^ (^^^^ , ^^^^−1, ^^^^−1). We accommodate the case ^^ = 0 as ^^0 := ^^0(^^0). Applying this approach to image and video compression, ^^^^ may be a frame at time ^^, ^^^^−1 may be a frame at time ^^ − 1, ^^^^−1 may be a previously reconstructed frame from time ^^ − 1, and ^^ may be the function corresponding to the neural networks of the AI-based compression pipeline that we are training. More specifically, the special ^^ = 0 case may correspond to encoding/decoding an I-frame with no dependency on a frame at a different time, whereas all other times correspond encoding/decoding a P-frame or B-frame from time ^^ that have some dependency on a frame from a different time ^^ − 1. Now, assume that ^^1 = ^^ ^^ for all ^^ ≥ 2, and define I := ^^0 and P := ^^1. Take for example the temporal sequence (^^0, ^^^^ , ^^^^)^^ = (^^0,I, ^^0), (^^0, P, ^^1), (^^0, P, ^^2), .. . , Observe that if P were a function of ^^^^−1 only, then P would (favourably) exhibit perfect recovery if we had ^^^^−1 = ^^0 and it held that P(^^^^−1) = P(^^0) = ^^^^ = ^^0. In other words, if P were a function of ^^^^−1 only, then we would desire that the ground truth data ^^0 be a fixed point of P. Applying this specifically to image and video compression, if P were a function of ^^^^−1 only (i.e. there is no new information in the current frame ^^^^ compared to the previously decoded frame ^^^^−1), then we would want the networks of the pipeline to perfectly reproduce the previously decoded frame ^^^^−1 as its output when encoding/decoding input ^^^^ .
Our wish accordingly is to invoke known results on fixed point existence theorems that rely on smoothness characterisations of the relevant mapping. However, in the present context the map P is generally non-smooth and is possibly a composite representation of mappings, some of which have discrete domain, destroying applicability of such smoothness characterisations. Concept 5 is directed to overcoming this problem by introducing the assumption that P is a composition of two maps: D ◦E such that D is almost-everywhere continuously differentiable and whose domain contains a compact Borel set of positive measure. One could think of E as an “encoder” that maps a signal to its compressed or latent representation; and D as a “decoder” that maps a compressed or latent representation to a corresponding signal in the output domain. With this in mind, define the auxiliary map Q : ( ^̂^^^ , ^^^^−1) ↦→ ( ^̂^^^+1, ^^^^) := (E( ^̂^^^ , ^^^^), D( ^̂^^^+1, ^^^^−1)). In the setting of the exemplary temporal sequence, one has (^^^^ , ^^^^−1, ^^^^−1) = (^^0, ^^0, ^^^^−1) ∈ FixQ if ^^^^−1 = ^^^^ . Morevoer, ^^ exhibits favourable recovery if ^^^^−1 = ^^^^ ≈ ^^0. In other words, if we apply the defined auxiliary map to a given input, the following behaviour is observed: if a current frame ^^^^ has no motion or new information relative to a previous frame ^^^^−1, the reconstructed current frame ^^^^ correctly corresponds exactly to the reconstructed previous frame ^^^^−1, satisfying our requirement that the networks should just re-use previous information where there is no motion between frames to perfectly reconstruct the previous information.
Thus we desire a method of promoting that Q admit such a fixed point and in particular to have (^^, ^^, ^^) ∈ FixQ for all ^^ ∼ D where D is some data distribution of interest that is supported on X. Under suitable conditions it is guaranteed that if the Lipschitz constant of a function is sufficiently small, that function admits a fixed point on a given domain (e.g., the Banach fixed point theorem and Browder fixed point theorem are both admitted by settings relevant to the current continuous-space framing of the question under present consideration). We thereby make the following Ansatz: an appropriately regularised and sufficiently over-parametrised function P^^ := E^^ ◦ D^^ with weights ^^ admits a configuration ^^0, auxiliary mapping Q := and sufficiently small parameter ^^ ≥ 0 (that may depend on parameter space complexity or other complexity factors) such that LipQ ≤ 1 and, thereby E ^^∼D ^^ ((^^, ^^, ^^), ^^^^^^Q) < ^^, where dist(^^, A) is the distance (e.g., Euclidean distance) of the point ^^ to the set A. Observe that ^^ encodes how well the “discovered” fixed point manifold Fix Q is adapted to the (unknown or incompletely characterised) data distribution D. In other words, P exists and can be learned such that its auxiliary mapping has a small Lipschitz constant and the fixed point set of the auxiliary mapping, induced thereby, significantly coincides with the points from the data distribution of interest. Thus, the training objective becomes the task of learning weights that produce a fixed point set for mappings that act on low-motion input frames, effectively
producing a network that acts like an identity operator on a previously decoded frame when the current frame is substantially the same as or similar to the previously decoded frame. Applying all of the above to temporal sequence modelling, such as that which occurs in (for example) AI-based image or video compression pipelines, our subsequent Ansatz is that this parameter ^^0 can be approximated with a suitable training regime. That is, the desired fixed point manifold can be well approximated by a deep-learning based temporal sequence model permitting stable long-term temporal dependence modelling with recurrent implementations. In general terms, there exists a set of weights for which the network acts as an identity operator when two input frames ^^^^ and ^^^^−1 are identical. Recall that an ^^-Lipschitz function ^^ that is differentiable at a point ^^0 satisfies, for a unit direction vector ^^: ^^^^ [ ^^ ] (^^0) ≤ ^^. In particular, for simplicity assume that the domain X is compact and that ^^ is everywhere continuous and almost everywhere continuously differentiable on X. Then, there exists a point ^^ ∈ X such that ^^ = sup^^∈S^^−1 ^^^^ [ ^^ ] (^^) = ∥^^ [ ^^ ] (^^)∥ (where ∥ · ∥ denotes the operator norm). Accordingly, when ^^ is sufficiently well behaved, controlling the operator norm of the Jacobian of ^^ controls its Lipschitz constant, in turn permitting the opportunity for obtaining a function ^^ with a fixed point manifold, and allowing us to train the network of the AI-based compression pipeline corresponding to this function ^^ by including a Jacobian penalty in the loss function based on the auxiliary map defined above.
That is, a Jacobian penalty calculated from an auxiliary function that is based on one or more of the components of the pipeline but modified so that (1) the input space matches the output space (that is, the input variables to the auxiliary function have the same form, shape and dimensions as the outputs) and (2) the input latent is passed through and returned by the function without amendment. To illustrate this at a practical level, consider the residual decoder of the compression pipeline of Figure 14. Our goal is to encourage the residual decoder to learn weights that perfectly reconstruct a previously decoded frame when the current frame is substantially identical (i.e. no motion between the frames of the sequence). The residual decoder of the actual pipeline receives as input a latent tensor, an optical flow information tensor (e.g. in the form of a warped current frame), and outputs a reconstructed image. In this case, the input space does not match the output space (i.e. the number of variables, the forms, shapes and dimensions of the inputs and outputs are different because the output is only the reconstructed frame and not the latent tensor). Thus, if we calculate a Jacobian penalty based on this (i.e. based on or approximated from a matrix of partial derivatives of how output changes with respect to changes in the input), and add this Jacobian penalty to the loss function, the weights will not generally converge to values that produce our goal of a residual decoder that perfectly reconstruct the previously decoded frame when the current frame is substantially identical. Instead, in this illustrative example, we construct an auxiliary function that: (i) is based on the residual decoder i.e. a function that operates identically to the residual decoder during a forward pass and accordingly similarly receives as inputs a latent tensor and an optical flow
information tensor in the form of a warped input image, but (ii) now not only returns the reconstructed image, but also returns as output the original input latent tensor. Thus we have a function where the input space (latent tensor and warped input image) matches the output space (latent tensor and reconstructed image). When we subsequently calculate the Jacobian penalty from this function, the latent tensor will appear as both a variable considered an input and a variable considered an output when the partial derivatives of the Jacobian matrix are being estimated or approximated. In layman’s terms, the effect of this is the moderation of significant changes to the latent tensor as sequences progress. In more formal terms, the above-described mathematical relationships apply and the weights converge to values during training that, during inference, exhibit the behaviour of perfectly reconstructing a previously decoded frame when the current frame is substantially identical. Accordingly, training an AI compression pipeline by using a Jacobian regularisation term based on an auxiliary function such as this, in which the input space matches the output space, encourages the learning of weights in which the networks are better able to reconstruct sequences of frames in a temporally consistent way. This illustrative example is given in the pseudocode below:
Algorithm 1 AI-based compression network training with residual decoder auxiliary mapping Jacobian penalty Inputs: Training dataset X, learning rate ^^, regularization parameter ^^, number of epochs ^^ , network architecture ^^^^ Initialize network parameters ^^ for epoch = 1 to ^^ do for each batch (^^^^−1, ^^^^ ) in X do Forward pass to compute predictions ^^^^ = ^^^^ (^^^^−1, ^^^^ ), including ^̂^^^ Forward pass through auxiliary function to compute Jacobian penalty ^^aux ( ^̂^^^ , ^^^^−1 ↦→ ^̂^^^ , ^^^^ ) Compute loss Loss = ^^ + ^^^^ + ^^aux Backward pass to compute gradients ∇^^Loss Update parameters with optimizer O: ^^ ← O(^^, ∇^^Loss, ^^) end for Optionally evaluate on validation set end for That is, a training data set ^^. A learning rate ^^, regularisation parameter ^^, and a number of training steps or epochs ^^ is selected. The network architecture of ^^^^ is defined, for example as shown in Figure 3 and Figure 14. The network parameters ^^ are randomly initialised and then the training loop is started. For each batch (^^^^−1, ^^^^) in the training data ^^, the forward pass is computed for the network being trained ^^^^ , as well as the auxiliary function. A Jacobian penalty ^^^^^^^^ ( ^̂^^^ , ^^^^−1 ↦→ ^̂^^^ , ^^^^) is then estimated as described above, and the total ^^^^^^^^ is calculated by combining a distortion term ^^, a rate term ^^, and the estimated Jacobian penalty. The backwards pass is then performed to compute gradients based on the loss, and the parameters ^^ are optimised using the optimiser, such as stochastic gradient descent SGD, or some other known optimiser. Optionally, a validation loss can be calculated and the training loop is repeated until the predetermined number of steps or epochs ^^ have been calculated, or
some other criteria have been reached. The learning rate, batch size, and or number of epochs may be optimised during training, for example using a learning rate scheduler or some other hyperparameter optimisation method. More generally, the hyperparameters may be optimised experimentally. As described above, the Jacobian penalty ^^aux( ^̂^^^ , ^^^^−1 ↦→ ^̂^^^ , ^^^^) is based on an auxiliary mapping function that includes the latent tensor ^̂^^^ on both sides of the mapping so that the input space matches the output space, thus resulting in convergence to a set of weights that exhibit the desired temporally stability for frame sequences. Note that, whilst the Jacobian penalty term is shown to be based on the mapping ( ^̂^^^ , ^^^^−1 ↦→ ^̂^^^ , ^^^^), it is envisaged that, each of these variables may be processed in some way without the general applicability of the above-described general mathematical properties of the input space matching the output space. For example, where the architecture of ^^^^ includes a flow module, and a flow estimation and/or warped previously decoded image is used in the place of ^^^^−1, then the mapping may be based on this warped image e.g. ^^^^−1,warped, and so on. Finally, it will be appreciated that, whilst the above example has been described in the context of the residual decoder in that the auxiliary function from which the Jacobian penalty is calculated is based on the residual decoder, the above-described principles can be extended to be more generally applicable to any of the networks in an AI-based compression pipeline in which encouraging temporal consistency is challenging.
For example, the algorithm described above can be extended to introduce a Jacobian penalty term based on any input to output mappings where one of the terms is included as both an input variable and output variable. This generalisation is illustrated in the pseudocode provided below:
Algorithm 2 AI-based compression network training with general auxiliary mapping Jacobian penalty Inputs: Training dataset X, learning rate ^^, regularization parameter ^^, number of epochs ^^ , network architecture ^^^^ Initialize network parameters ^^ for epoch = 1 to ^^ do for each batch (^^^^−1, ^^^^ ) in X do Forward pass to compute predictions ^^^^ = ^^^^ (^^^^−1, ^^^^ ), including ^̂^^^ Forward pass through auxiliary function to compute Jacobian penalty ^^aux (^^,^^ ↦→ ^^, ^^) Compute loss Loss = ^^ + ^^1^^ + ^^2^^aux Backward pass to compute gradients ∇^^Loss Update parameters with optimizer O: ^^ ← O(^^, ∇^^Loss, ^^) end for Optionally evaluate on validation set end for Similarly to the specific residual decoder example, a training data set ^^ is provided. A learning rate ^^, regularisation parameter ^^, and a number of training steps or epochs ^^ is selected. The network architecture of ^^^^ is defined, for example as shown in Figure 3 and Figure 14. The network parameters ^^ are randomly initialised and then the training loop is started. For each batch (^^^^−1, ^^^^) in the training data ^^, the forward pass is computed for network being trained ^^^^ , as well as the auxiliary function. A Jacobian penalty ^^^^^^^^ (^^,^^ ↦→ ^^, ^^) is then estimated as described above, and the total ^^^^^^^^ is calculated by combining a distortion term ^^, a rate term ^^, and the estimated Jacobian penalty. Some or all of the terms may be regularised by some constant ^^1 and/or ^^2. The backwards pass is then performed to compute gradients based on the loss, and the parameters ^^ are optimised using the optimiser, such as stochastic gradient descent SGD, or some other known optimiser. A validation loss then optionally be
calculated and the training loop is repeated until the predetermined number of steps or epochs ^^ have been calculated, or some other criteria has been reached. The learning rate, batch size, and or number of epochs may be optimised during training, for example using a learning rate scheduler or some other hyper parameter optimisation method. More generally, the hyper parameters may be optimised experimentally. In this general case, ^^ , ^^ and ^^ can be any variables or set of variables from an AI-based compression pipeline including, but not limited to one or more of: a current input image: ^^^^ , a reference input image: ^^^^−1 or ^^^^+1, a reference previously reconstructed image (warped or unwarped): ^^^^−1, ^^^^ or ^^^^+1, ^^^^−1,warped, ^^^^,warped or ^^^^+1,warped, a flow: ^^^^ , and/or any latent
on, including in using any of these variables in any combination in upsampled and downsampled spaces as applicable. For example, for low-motion type frame sequences, such as static nature scenes or CCTV feeds where it is typically expected that flow will not vary significantly between frames and where networks are failing to converge to a set of weights where the desired behaviour to produce flow information that varies little between frames, we can construct an auxiliary mapping from which to estimate a Jacobian penalty by using, for example: ^^^^^^^^ ( ^̂^ ^^ ^^ , ^^^^−1 ↦→ ^̂^ ^^ ^^ , ^^^^)
That is, the reconstructed flow latent is provided on both sides of the auxiliary mapping from which the Jacobian penalty is calculated. When the loss function then incorporates this Jacobian penalty term, the network weights converge to a set of values where flow is
encouraged to be reconstructed in a way that varies little between across a sequence of frames. Effectively the networks are learning the set of weights that have a fixed point for the desired auxiliary mapping behaviour. A similar approach can be taken for any and all of the example variables referred to above to encourage the networks of the AI-based compression pipeline to learn how to behave in a temporally consistent way for frame sequences. It is also envisaged that multiple different Jacobian penalties based on such mappings may be introduced into the loss term, for example ^^^^^^^^1 , ^^^^^^^^2 , ..., ^^^^^^^^^^ as applicable to encourage the learning of temporally consistent behaviour by any number of components of the AI-based compression pipeline. Further, any weighted norm of any such Jacobian penalty or penalties may be incorporated into the AI-based compression pipeline. Specifically, a weighted norm of the Jacobian multiplies the components of the matrix using weights, computed in some manner relevant to the task at hand, such that the resulting "weight and norm" operation still satisfies the mathematical definition of being a (quasi) norm. Such weights may be computed for example as a function of the amount of motion information that is present in the image; or according to metrics that define the presence of occlusion between two frames. The motion information may be captured indirectly in the flow information estimated by the flow module of the pipeline, or by some other measure such as, for example, a direct pixel difference calculation between the pixels of two or more images. Moving onto practical considerations, given the many different variables and mappings from which the Jacobian penalty described above may be estimated, the above-described approach
can increase network training times significantly, for example 30% or more. To help address this, it is envisaged that the presently described Jacobian penalties may be estimated using the following approach. It will be understood that the Frobenius norm of a square matrix ^^ ∈ R^^×^^ also controls its operator norm. Further recall that Hutchison’s trace estimator (where ^^ ∈ R^^ with ^^ i ^^ ∼id N(0, 1)) is given by [ ] [ ∑ ℓ^
tr(^^) = ^^E ^^⊤^^ = E ^^⊤ ] 1 ^^^^ ≈ (^̂^( ^^))⊤^^^̂^( ^^) ^^ ^^=1 where ^̂^( ^^) , ^^ = 1, .. . , ℓ are ℓ independent realisations of ^^. Now suppose that ^^ := ^^⊤ [ ^^ ] (^^)^^ [ ^^ ] (^^) and observe that the above equation provides an ℓ-sample estimate of the Jacobian-vector product of a function ^^ . Thus, in the standard set-up for minimising a loss function using stochastic gradient descent, suppose on iteration ^^ that one has batch ^^ (^^) and a function ^^ (^^; ^^ (^^)) with parameters ^^ (^^) . Set ℓ = 1 to take a 1-sample approximation for the trace estimate, i.e., tr(^^) = tr(^^⊤^^) = ∥^^∥2 ≈ ^^⊤^^⊤ [ ^^ (^^) (^^) (^^) (^^) ^^ (·;^^ )] (^^ )^^ [ ^^ (·; ^^ )] (^^ )^^ Now
(^^) ^^ [ ^^ (·; ^^ ))] (^^ (^^))^^ = ^^ [ ^^ ( (^^) (^^) ^^ (^^ + ℎ (^^) (^^) (^^) (^^ ^^; ^^ ) − ^^ (^^ ; ^^ ) ^^ ·;^^ )] (^^ ) ≈ for
step size estimation strategy are here omitted for brevity. Applying the above formal explanation to the specific example of calculating Jacobian penalties for an AI-based compression pipeline loss function, recall that the Jacobian penalty may be
calculated by estimating the partial derivatives of the network’s outputs with respect to its inputs, which together form a Jacobian matrix. The penalty may be estimated by calculating the norm of this matrix (or the trace of a related matrix - as is known in the art). However, calculating the norm (or trace, as applicable) of the Jacobian matrix is computationally very expensive. Instead we use an approach based on Hutchison’s trace estimator and finite differences. In the specific case of AI-based compression pipelines, it turns out that we can get a good trace estimate by making a 1-sample approximation. Even though the sample size is only 1, the present inventors have found that the estimate is nevertheless accurate enough to estimate a Jacobian penalty that encourages the learning of weights with the desired behaviours described above. Accordingly, the 1-sample approximation using finite differences facilitates the estimation of a Jacobian penalty in a way that is significantly faster than traditional methods and facilitates training with multiple Jacobian penalties without significantly increasing training times. That is, the time attributed to estimating the Jacobian penalty is an insignificant fraction of the overall training time. This in turn enables training with Jacobian penalties applied to multiple components of the pipeline to encourage temporally stable behaviour while keeping overall training times substantially the same thereby reducing overall cost per training run. Another practical consideration that is now discussed is how to ensure that the estimated Jacobian penalty(ies) are not over-penalising the loss when training frame sequences where its introduction is not conducive to learning a suitable set of weights. For example, consider
again the Jacobian penalty described above that is based on an auxiliary function based on the residual decoder. As described above, the Jacobian penalty in this case encourages the learning of a set of weights that allow the near-perfect reconstruction of a previously decoded image when there is little or no movement between frames. However, when this approach is applied to very high-motion frame sequences, the behaviour we are encouraging with the Jacobian penalty results in a drop in accuracy. To overcome this issue, the present inventors have realised that the Jacobian penalty described above can itself be regularised based on a property of the frames of the sequence being trained on. That is, the Jacobian penalty may be made "motion-aware" by weighting it according to a property of the frames of the sequence being trained on such as how much movement there is between frames. This movement may be captured indirectly in the Jacobian matrix from which the Jacobian penalty is calculated and accordingly the penalty may comprise a weighted norm ^^(^^aux) where the mapping ^^ may comprise e.g. Frobenius norm, a spectral norm, or any other norm. Whereby, for example, the the weighting may scale the Jacobian penalty based on the combined strength of all the partial derivatives it contains, which will be higher in high motion frame sequences. Alternatively, the movement may be captured directly and the mapping that encodes the weighting ^^ of the Jacobian may be based on e.g. pixel differences, MSE, or some other measure of differences between the frames at time ^^ and some other time ^^ − 1.
At a general level, the idea is that if the motion between two frames ^^^^ and ^^^^−1 is large, then we want the Jacobian penalty described above to be dampened to a lower value. Conversely, if the motion between two frames, two corresponding pixel regions within a frame, or any other information between which relative motion may exist, is small, then we don’t want to dampen the Jacobian penalty associated with those frames/regions/pixels and so on at all and instead allow it to have the effects described above to learn the desired fixed point weights for low-motion frames. This regularisation of the Jacobian penalty term may comprise a weighted norm or some other weighting and is illustrated in the pseudocode below:
Algorithm 3 AI-based compression network training with motion-aware auxiliary mapping Jacobian penalty Inputs: Training dataset X, learning rate ^^, regularization parameter ^^, number of epochs ^^ , network architecture ^^^^ Initialize network parameters ^^ for epoch = 1 to ^^ do for each batch (^^^^−1, ^^^^ ) in X do Forward pass to compute predictions ^^^^ = ^^^^ (^^^^−1, ^^^^ ), including ^̂^^^ Forward pass through auxiliary function to compute Jacobian penalty ^^^^^^^^ (^^,^^ ↦→ ^^, ^^) Compute loss Loss = ^^ + ^^1^^ + ^^2^^(^^aux) Backward pass to compute gradients ∇^^Loss Update parameters with optimizer O: ^^ ← O(^^, ∇^^ ^^^^^^^^, ^^) end for Optionally evaluate on validation set end for Similarly to the specific residual decoder example, a training data set ^^ is provided. A learning rate ^^, regularisation parameter ^^, and a number of training steps or epochs ^^ is selected. The network architecture of ^^^^ is defined, for example as shown in Figure 3 and Figure 14. The network parameters ^^ are randomly initialised and then the training loop is started. For each batch (^^^^−1, ^^^^) in the training data ^^, the forward pass is computed for the network being trained ^^^^ , as well as the auxiliary function. A Jacobian penalty ^^^^^^^^ (^^,^^ ↦→ ^^, ^^) is then estimated as described above, and the total Loss is calculated by combining a distortion term ^^, a rate term ^^, and the estimated Jacobian penalty that in this case is weighted by ^^. As above, the loss term may be regularised using e.g. ^^1 and/or ^^2 The backwards pass is then performed to compute gradients based on the loss, and the parameters ^^ are optimised using the optimiser, such as stochastic gradient descent SGD, or some other known optimiser. Optionally,
a validation loss may be calculated and the training loop is repeated until the predetermined number of steps or epochs ^^ have been calculated, or some other criteria have been reached. The learning rate, batch size, and or number of epochs may be optimised during training, for example using a learning rate scheduler or some other hyper parameter optimisation method. More generally, the hyper parameters may be optimised experimentally. As explained above, the Jacobian penalty here is weighted by a function ^^, where ^^ is defined to act element-wise on the object returned by a Jacobian-vector product between the Jacobian matrix and some random vector (in practice a tensor) such that the weighted norm induced by ^^ satisfies the mathematical definitions of a quasi-norm or pseudo-norm. A further difficulty that arises when training using the above-described Jacobian penalty is its interactions with different phases of training and different training schedules. For example, the present description has been framed in the context of training on sequences of frames, the specific number of frames in a given sequence or "group of pictures" (GOP) used during training can vary. Note that a GOP is typically considered to be an I-frame and ^^ P- or B-frames. Consider for example a training schedule that starts training on short GOPs of 5-6 frames, and then after a predetermined number of steps switches to training on longer GOPs of 7 or more frames e.g. 8, 9, 10, 20, 30, 40, 50 frames or more. The Jacobian penalty described herein is ideally suited for encouraging the networks to be performant on said longer GOPs but can in some cases act as noise when performing the initial training on the shorter GOPs. This may be, for example, because the temporal consistency is less of a problem for shorter GOPs and so the Jacobian penalty actually weakens the strength of the training signal.
More generally, Jacobian regularisation serves as an inductive bias to improve the network’s generalisation performance to sequences with GOP sizes that are significantly greater than that seen in training. Indeed, if the network is only evaluated on sequences with the same GOP seen in training, for example short GOP sequences, then temporal stability can be less problematic. However, it follows that, for temporal stability arising solely from training on different GOP sizes to be present in the wild, the training data necessary to achieve this would have to contain equal numbers of samples of all GOP sizes distributed equally across all video sequences and so on - something which is burdensome to obtain in real world settings. The presently described approach accordingly facilitates obtaining the same effect but without needing as complete a training data set. Indeed, it becomes computationally prohibitive to train on very large GOP sizes so even if a complete training data set comprising large samples of all GOP sizes encountered in the wild is available, it may not be possible to practically train on all of the data. Accordingly, the present disclosure faclitates the generalisation of performance in the sense of temporal stability for large GOP sizes — specifically those that are significantly larger than what’s seen during training, irrespective of what GOP sizes are in the training data. To overcome this problem, it is envisaged that the presently described Jacobian penalty term may be introduced only at a predetermined time or times (e.g. in terms of number of training steps or consequential to changing training frame sequence length) during a training schedule. Further, the Jacobian penalty term may also be removed at a predetermined time or times for similar reason or reasons as may be applicable for a given training schedule. This adaptive approach thus enables the advantages of the above-described Jacobian penalty term to be realised in a flexible way to fit in with any existing training schedules.
Figure 14 illustrates a further example of a flow-residual compression pipeline, such as that of Figure 3, whereby the representation of flow information that the residual decoder receives as input comprises a warped previously decoded image. As this architecture corresponds generally to the flow-residual compression pipeline shown in Figure 3 it accordingly uses the same reference numbers for corresponding features. The flow module 1400 is shown to comprise a flow encoder 1401 that produces a latent representation of optical flow information ^̂^ ^^ ^^ which is decoded by a flow decoder 1402 into a reconstructed flow ^^^^ which in turn is used to warp a previously reconstructed image ^^^^−1 to produce ^^^^−1,^^ which in turn is fed into the residual decoder 1413 as the representation of optical flow information. Thus the residual decoder neural network 1413 uses a latent representation ^^^^ of the current frame ^^^^ , and the warped previously reconstructed image ^^^^−1 to produce the reconstructed current frame ^^^^ . Thus, based on this architecture, the Jacobian penalty term described above may be implemented by constructing an auxiliary function that copies the operations of the residual encoder 1411 and/or residual decoder 1413 of the residual part 1410 including using the same inputs as the residual decoder 1413, but the auxiliary function also returns as an output the latent representation ^^^^ that was one of its inputs, effectively simply passing the latent representation ^^^^ 1412 directly through the function. The Jacobian penalty based on this auxiliary function with the latent being both an input and an output has the desired mathematical properties to encourage convergence to a set of weights that produce a network that behaves in a temporally stable manner for sequences of frames. While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular
inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in 25 multiple examples separately or in any suitable sub-combination. Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the examples described above should not be understood as requiring such separation in all examples, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. The subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that
is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal. The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in
multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book
reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable devices, that includes one or more processors and computer readable media, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few. Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. The subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet. The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of
client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in multiple examples separately or in any suitable subcombination. Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the examples described above should not be understood as requiring such separation in all examples, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Claims
2. The method of claim 1, wherein the upsampler comprises a third neural network, and wherein the method comprises updating the parameters of the third neural network based on the evaluated function. 3. The method of claim 2, wherein the downsampler comprises a bilinear or bicubic downsampler. 4. The method of claim 2 or 3, wherein the downsampler comprises a Gaussian blur filter. 5. The method of any of claims 2 to 4, comprising (i) updating the parameters of the first neural network and the second neural network based on the evaluated function for a first number of said steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the third neural network, (ii) freezing the parameters of the first and second neural networks after said number first number of steps, and performing said upsampling and downsampling, and said updating of the parameters of the third neural network for a second number of said steps. 6. The method of claim 1, wherein the downsampler comprises a fourth neural network, and wherein the method comprises updating the parameters of the fourth neural network based on the evaluated function. 7. The method of claim 5, wherein the upsampler comprises a bilinear or bicubic upsampler. 8. The method of claim 6 or 7, comprising (i) updating the parameters of the first neural network and the second neural network based on the evaluated function for a first number of
said steps to produce the first and second trained neural networks without performing said upsampling and downsampling and without updating the parameters of the fourth neural network, (ii) freezing the parameters of the first and second neural networks after said number first number of steps, and performing said downsampling and said updating of the parameters of the fourth neural network for a second number of said steps. 9. The method of any of claims 2 to 8, wherein the method comprises entropy encoding the latent representation into a bitstream having a length, wherein the function is further based on said bitstream length, and wherein said updating the parameters of the third or fourth neural network is based on the evaluated function based on the bitstream length. 10. The method of any of claims 1 to 9, wherein the difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image is determined based on the output of a fifth neural network acting as a discriminator. 11. The method of any of claims 1 to 10, wherein the said difference between one or more of: the output image and the input image, the output image and the downsampled input image, the upsampled output image and the input image, and/or the upsampled output image and the downsampled input image, comprises a mean squared error (MSE) and/or a structural similarity index measure (SSIM).
12. The method of any of claims 1 to 11, wherein the function further comprises a term defining a visual perceptual metric. 13. The method of claim 12, wherein the term defining a visual perceptual metric comprises a MS-SIM metric. 14. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; upsampling the output image with an upsampler to produce an upsampled output image, the upsampler comprising a third neural network; evaluating a function based on a difference between one or both of: the output image and the input image, and/or the upsampled output image and the input image; updating the parameters of the third neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. 15. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; downsampling the input image with a downsampler to produce a downsampled input image, the downsampler comprising a fourth neural network; encoding the downsampled input image using a first neural network to produce a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; evaluating a function based on a difference between one or more of: the output image and the input image, the output image and the downsampled input image and/or the input image and a previously downsampled input image; updating the parameters of the fourth neural network based on the evaluated function; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network. 16. The method of claim 15, comprising producing the previously downsampled input image by performing bilinear or bicubic downsampling on the input image. 17. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; downsampling the input image with a downsampler; encoding the downsampled input image using a first trained neural network to produce a
latent representation; transmitting the latent representation to a second computer system; decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; and upsampling the output image with an upsampler to produce an upsampled output image. 18. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input input image using a first trained neural network to produce a latent representation; transmitting the latent representation to a second computer system; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein one or more of the above steps comprises performing a downsampling or upsampling operation, and wherein the downsampling or upsampling operation comprises performing one or convolution operations without performing a space-to-depth or depth-to-space operation. 19. The method of claim 18, comprising performing the downsampling or upsampling operation on a CPU without performing a space-to-depth or depth-to-space operation, and wherein said downsampling or upsampling is configured to be performed in real-time or near real-time.
20. The method of claim 18, comprising performing the downsampling or upsampling operation on a neural accelerator without performing a space-to-depth or depth-to-space operation, and wherein said downsampling or upsampling is configured to be performed in real-time or near real-time. 21. The method of claim 19 or 20, wherein the downsampling operation comprises applying one or more convolutional layers with a kernel size based on a downsampling factor, and wherein the convolutional layers are configured to sequentially reduce the spatial dimensions of an input to the series of convolutional layers while increasing the depth or channel dimension of the input. 22. The method of claim 21, wherein the input comprises said input image. 23. The method of claim 21, wherein the input comprises a tensor representation of the input image. 24. The method of claim 19 or 20, wherein the downsampling operation comprises applying one or more convolutional layers configured with a stride equal to the downsampling factor, and wherein the number of filters in each convolutional layer is based on the original number of channels in the input tensor and on the downsampling factor. 25. The method of claim 19 or 20, wherein the upsampling operation comprises applying sequential convolutional layers and upsampling layers, wherein the convolutional layers
are configured to progressively increase the spatial dimensions and decrease the channel dimensions of an input to said layers. 26. The method of claim 25, wherein the input comprises the latent representation. 27. The method of claim 25, wherein the input comprises a tensor representation of the latent representation or the output image. 28. The method of claim 25, wherein the convolutional layers for upsampling are configured with a stride based on an upsampling factor. 29. The method of claim 25, further comprising applying an activation function after each convolutional layer in the upsampling operation. 30. The method of claim 25, wherein the upsampling layers are selected from a group consisting of nearest neighbor upsampling, bilinear upsampling, and bicubic upsampling, and are alternated with the convolutional layers. 31. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, the optical flow information being indicative of a difference between a representation of the second image and a representation of the input image; transmitting the optical flow information to a second computer system;
decoding the optical flow information using a second neural network;and using the second image and the decoded optical flow information to produce an output image, wherein the output image is an approximation of the input image; wherein estimating the optical flow information comprises estimating differences between the input image and second image by: applying a first convolution operation on respective pixels of a representation of the input image and/or on respective pixels of a representation of the second image, wherein the convolution operation comprises applying one or more filters comprising weights having values randomly distributed between a minimum and a maximum value. 32. The method of claim 31, comprising estimating a compressively encoded cost volume indicative of said differences by applying said first convolution operation. 33. The method of claim 31, wherein the first convolution substantially preserves a norm of a distribution of the respective pixels of the representation of the input image and/or respective pixels of the representation of the second image. 34. The method of any of claims 31 to 33, wherein a distribution of values of pixels of the representation of the input image and/or the distribution of values of pixels of the representation of the second image are sparse distributions in a spatial domain of the representation of the input image and/or the second image. 35. The method of claim 34, wherein the weights have values distributed according to a sub-Gaussian distribution.
36. The method of claim 34 or 35, wherein the minimum value and/or the maximum value are based on a number of channels of the input image and/or second image, on a kernel size of the first convolution operation, and/or on a pixel radius across which said differences are estimated. 37. The method of any of claims 31 to 36, comprising performing a second convolution operation on an output of the first convolution operation, wherein the second convolution operation substantially preserves a norm of a distribution of said output of the first convolution operation. 38. The method of any of claims 31 to 37, comprising estimating a difference between an output of the second convolution operation and an output of the first convolution operation. 39. The method of claim 38, wherein said difference comprises an absolute difference. 40. The method of claim 39, wherein said difference defines a cost volume. 41. The method of any of claims 31 to 40, comprising using the optical flow information to warp a representation of the second image. 42. The method of claim 41, comprising estimating a difference between the warped second image and the input image to generate a residual representation of the input image with respect to the warped second image.
43. The method of claim 42, comprising: (i) using a third neural network to encode the residual representation of the input image, (ii) transmitting the encoded residual representation of the input image to the second computer system, (iii) using a fourth neural network to decode the residual representation of the input image, and (iv) using decoded the residual representation of the input image to produce said output image. 44. The method of any of claims 37 to 43, comprising applying a third convolution operation to an output of the first convolution operation and/or to an output of the second convolution operation. 45. The method of any of claims 31 to 44, wherein a kernel size of the second convolution operation is greater than a kernel size of the first convolution operation. 46. The method of any of claims 31 to 45, wherein the first convolution operation is defined by a 1x1 kernel. 47. The method of any of claims 31 to 46, wherein the second convolution operation is defined by a 3x3 kernel. 48. The method of any of claims 44 to to 47, wherein the third convolution operation is defined by a 1x1 kernel. 49. The method of any of claims 31 to 48, wherein performing the second convolution operation entangles information associated with respective pixels of the representation of
the input image with information associated with pixels adjacent corresponding pixels in the representation of the second image. 50. The method of any of claims 31 to 49, wherein the first, second, and where present third convolution operation are performed without grouped convolutions. 51. The method of claim 50, wherein one or more outputs of the first, second and/or third convolution operation are stored in contiguous memory blocks, and wherein said estimating a difference comprises retrieving said stored outputs from said contiguous memory blocks. 52. The method of any of claims 31 to 51, wherein a distribution of pixel values of the input image and of the second image is comprises a sparse distribution, and wherein the sparse distribution is incoherent in a spatial domain. 53. A data processing system configured to perform the method of any one of claims 31 to 52. 54. A method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, the optical flow information being indicative of a difference between a representation of the second image and a representation of the input image; transmitting the optical flow information to a second computer system; wherein estimating the optical flow information comprises estimating differences between
the input image and second image by: applying a first convolution operation on respective pixels of a representation of the input image and/or on respective pixels of a representation of the second image, wherein the convolution operation comprises applying one or more filters comprising weights having values randomly distributed between a minimum and a maximum value. 55. A method for lossy image or video decoding, the method comprising the steps of: receiving an input image and a second image at a first computer system; estimating optical flow information using the second image and the input image using a first neural network, the optical flow information being indicative of a difference between a representation of the second image and a representation of the input image and based on a compressively encoded cost volume; at a second computer system, receiving optical flow information, the optical flow information being indicative of a difference between a representation of a second image and a representation of an input image; decoding the optical flow information using a second neural network; and using the second image and the decoded optical flow information to produce an output image, wherein the output image is an approximation of the input image. 56. A data processing apparatus configured to perform the method of claims 54 or 55. 57. A method of estimating a difference between a first image and a second image, the method comprising: performing a first convolution operation on respective pixels of a representation of the
first image and on respective pixels of a representation of the second image; and estimating a difference between the first image and the second image based on one or more outputs of the first convolution operation on the first and second images by: estimating a compressively encoded cost volume indicative of said differences. 58. The method of claim 57, comprising performing a second convolution operation on an output of the first convolution operation; and estimating a difference between an output of the second convolution operation and the first convolution operation. 59. The method of claim 58, wherein performing the second convolution operation entangles information associated with respective pixels of the representation of the first with information associated with pixels adjacent corresponding pixels in the representation of the second image. 60. The method of any of claims 57 to 59, wherein the first convolution operation comprises applying one or more filters comprising weights having values randomly distributed between a minimum value and a maximum value. 61. The method of claim 60, wherein the minimum value and/or the maximum value are based on a number of channels of the input image and/or second image, on a kernel size of the first convolution operation, and/or on a pixel radius across which said differences are estimated. 62. The method of any of claims 57 to 61, wherein said difference comprises an absolute difference. 63. The method of any of claims 57 to 62, wherein said difference defines a cost volume.
64. The method of any of claims 57 to 63, comprising applying a third convolution operation to an output of the first convolution operation and/or to an output of the second convolution operation. 65. The method of any of claims 57 to 64, wherein a kernel size of the second convolution operation is greater than a kernel size of the first convolution operation. 66. The method of any of claims 57 to 65, wherein the first convolution operation is defined by a 1x1 kernel. 67. The method of any of claims 57 to 66, wherein the second convolution operation is defined by a 3x3 kernel. 68. The method of any of claims 57 to 67, wherein the third convolution operation is defined by a 1x1 kernel. 69. The method of any of claims 57 to 68, comprising storing a plurality of said respective outputs of the first, second and/or third convolution operations in contiguous memory blocks, and wherein said estimating a difference comprises retrieving said stored outputs from said contiguous memory blocks. 70. The method of any of claims 57 to 69, comprising using said difference to identify one or more pixel patches in the second image as movement-containing pixel patches, and generating a bounding box around one or more of said movement-containing pixel patches.
71. data processing apparatus configured to perform the method of claims 57 to 70. 72. A method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information, the optical flow information being indicative of a difference between the first image and the second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; evaluating a function based on a difference between the first image and the second image, the function comprising a Jacobian penalty term, updating the parameters of the first, second and/or third neural networks based on the evaluated function; and repeating the above steps using a first set of input images to produce first, second and/or third trained neural networks. 73. The method of claim 72, wherein the Jacobian penalty term is based on a rate of change of one or more first variables with respect to one or more second variables, the first variables and second variables selected from inputs and/or outputs associated with the one or more neural networks. 74. The method of claim 73, wherein at least the input and/or output associated with the one or more neural networks is both a first variable and a second variable. 75. The method of any of claims 73 to 74, comprising producing the second variables from the first variables by mapping the first variables to the second variables. 76. The method of claim 75 wherein the mapping is defined by an auxiliary function. 77. The method of claim 76, wherein the first variables are inputs to the auxiliary function and the second variables are outputs of the auxiliary function. 78. The method of claim 77, wherein at least one input of said inputs to the auxiliary function is also an output of the auxiliary function. 79. The method of claim 78, wherein the inputs of said mapping are defined in an input space, and the outputs of said mapping are defined in an output space, and wherein the auxiliary function maps the input space to the output space. 80. The method of claim 79, wherein the input space matches the output space. 81. The method of any of claims 76 to 80, wherein the auxiliary function is based on the third neural network.
82. The method of claim 81, wherein the third neural network comprises a residual decoder neural network. 83. The method of any of claims 78 to 82, wherein the at least one input to the auxiliary function that is an output of the auxiliary function comprises said latent representation of the first image. 84. The method of any of claims 72 to 83, comprising weighting the Jacobian penalty term. 85. The method of claim 84, wherein said weighting is based on a difference between the first image and the second image. 86. The method of claim 84 when dependent on claim 73, wherein said weighting is defined by a weighted norm based on a matrix associated with said rate of change. 87. The method of any of claims 73 to 86 when dependent on claim 83, comprising estimating the Jacobian penalty term by approximating a norm of a matrix associated with said rate of change. 88. The method of claim 87, wherein approximating the norm of the matrix comprises making a single sample approximation. 89. The method of any of claims 72 to 88, wherein the method comprises introducing the Jacobian penalty term into said function after a first number of said repeated steps.
90. The method of claim 89, wherein said first number of said repeated steps is based on a GOP-size of one or more frame sequences in said first set of input images. 91. A method of performing lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information, the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; wherein the first neural network, the second neural network, and the third neural network are produced according to the method of any of claims 72 to 90. 92. A method of performing lossy image or video encoding, transmission, the method comprising the steps of: receiving a first image and a second image at a first computer system; with a first neural network, producing a latent representation of optical flow information,
the optical flow information being indicative of a difference between the first image and the second image; transmitting the latent representation of optical flow information to a second computer system; wherein the first neural network,is produced according to the method of any of claims 72 to 90. 93. A method of performing lossy image decoding, the method comprising the steps of: receiving a latent representation of optical flow information at a second computer system;, the optical flow information being indicative of a difference between a first image and a second image; with a second neural network, decoding the latent representation of optical flow information to produce an approximation of the optical flow information; with a third neural network, producing an output image using the optical flow information and using a representation of the first image, wherein the output image is an approximation of the first image; wherein the second neural network and the third neural network are produced according to the method of any of claims 72 to 90. 94. A method of performing lossy image or video decoding, the method comprising the steps of: with a second neural network, at a second computer system, decoding a latent represen- tation to produce a first output image, wherein the first output image is an approximation of
one image of an image pair of a first sequence of input images; repeating the above step to produce a first sequence of output images, the first sequence of output images being an approximation of the first sequence of input images; wherein the second neural network is produced according to the method of any of claims 72 to 90. 95. A data processing apparatus configured to perform the method of any of claims 72 to 94. 96. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 72 to 94. 97. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer carry out the method of any of claims 72 to 94.
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GBGB2318468.2A GB202318468D0 (en) | 2023-12-04 | 2023-12-04 | Method and data processing system for lossy image or video encoding, transmission and decoding |
| GB2318468.2 | 2023-12-04 | ||
| GB2400790.8 | 2024-01-22 | ||
| GBGB2400790.8A GB202400790D0 (en) | 2024-01-22 | 2024-01-22 | Method and data processing system for lossy image or video encoding, transmission and decoding |
| GB2404639.3 | 2024-04-02 | ||
| GBGB2404639.3A GB202404639D0 (en) | 2024-04-02 | 2024-04-02 | Method and data processing system for lossy image or video encoding, transmission and decoding |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025119707A1 true WO2025119707A1 (en) | 2025-06-12 |
Family
ID=93841020
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2024/083641 Pending WO2025119707A1 (en) | 2023-12-04 | 2024-11-26 | Method and data processing system for lossy image or video encoding, transmission and decoding |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025119707A1 (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021220008A1 (en) | 2020-04-29 | 2021-11-04 | Deep Render Ltd | Image compression and decoding, video compression and decoding: methods and systems |
| WO2022106013A1 (en) * | 2020-11-19 | 2022-05-27 | Huawei Technologies Co., Ltd. | Method for chroma subsampled formats handling in machine-learning-based picture coding |
-
2024
- 2024-11-26 WO PCT/EP2024/083641 patent/WO2025119707A1/en active Pending
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2021220008A1 (en) | 2020-04-29 | 2021-11-04 | Deep Render Ltd | Image compression and decoding, video compression and decoding: methods and systems |
| WO2022106013A1 (en) * | 2020-11-19 | 2022-05-27 | Huawei Technologies Co., Ltd. | Method for chroma subsampled formats handling in machine-learning-based picture coding |
Non-Patent Citations (6)
| Title |
|---|
| AGUSTSSON, EMINNEN, D.JOHNSTON, N.BALLE, J.HWANG, S. J.TODERICI, G.: "Scale-space flow for end-to-end optimized video compression.", IN PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2020, pages 8503 - 8512 |
| BALLÉ, JOHANNES ET AL.: "Variational image compression with a scale hyperprior", ARXIV PREPRINT ARXIV: 1802.01436, 2018 |
| LI-HENG CHEN ET AL: "Estimating the Resize Parameter in End-to-end Learned Image Compression", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 April 2022 (2022-04-26), XP091209110 * |
| MENTZER, F.AGUSTSSON, EBALLE, J.MINNEN, DJOHNSTON, NTODERICI, G.: "Neural video compression using gans for detail synthesis and propagation. In Computer Vision-ECCV 2022", 17TH EUROPEAN CONFERENCE, TEL AVIV, ISRAEL, OCTOBER 23-27, 2022, PROCEEDINGS, PART XXVI, November 2022 (2022-11-01), pages 562 - 578 |
| MENTZER, FAGUSTSSON, E.BALLÉ, J.MINNEN, D.JOHNSTON, N.TODERICI, G., IN COMPUTER VISION-ECCV 2022: 17TH EUROPEAN CONFERENCE, TEL AVIV, ISRAEL, OCTOBER 23-27, 2022, PROCEEDINGS, PART XXVI, November 2022 (2022-11-01), pages 562 - 578 |
| POURREZA, R.COHEN, T.: "Extending neural p-frame codecs for b-frame coding.", IN PROCEEDINGS OF THE IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2021, pages 6680 - 6689 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11234006B2 (en) | Training end-to-end video processes | |
| TWI834087B (en) | Method and apparatus for reconstruct image from bitstreams and encoding image into bitstreams, and computer program product | |
| US10701394B1 (en) | Real-time video super-resolution with spatio-temporal networks and motion compensation | |
| US10582205B2 (en) | Enhancing visual data using strided convolutions | |
| TW202247650A (en) | Implicit image and video compression using machine learning systems | |
| US8396313B2 (en) | Image compression and decompression using the PIXON method | |
| Cheng et al. | Optimizing image compression via joint learning with denoising | |
| US11936866B2 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| US12113985B2 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| KR20250078907A (en) | Diffusion-based data compression | |
| KR20250002528A (en) | Parallel processing of image regions using neural networks - decoding, post-filtering, and RDOQ | |
| GB2548749A (en) | Online training of hierarchical algorithms | |
| WO2024170794A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025082896A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding using image comparisons and machine learning | |
| WO2025119707A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| Ayyoubzadeh et al. | Lossless compression of mosaic images with convolutional neural network prediction | |
| US20250227311A1 (en) | Method and compression framework with post-processing for machine vision | |
| WO2025196024A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025168485A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| Ehrlich | The first principles of deep learning and compression | |
| WO2025061586A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025088034A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2024246275A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025210218A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding | |
| WO2025252644A1 (en) | Method and data processing system for lossy image or video encoding, transmission and decoding |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24820272 Country of ref document: EP Kind code of ref document: A1 |