TW202446075A

TW202446075A - Efficient warping-based neural video coder

Info

Publication number: TW202446075A
Application number: TW113107964A
Authority: TW
Inventors: 羅森戴爾泰斯查漢凡; 黃功明黎; 圖沙爾辛哈爾; 阿米爾塞德; 克里希納布斯卡; 古拉麥康瑞德索堤爾; 安如曼拉哈; 奧克約里斯維格斯; 弗蘭克史蒂芬馬耶爾; 梁張; 阿布西吉特寇巴爾; 穆拉利達爾雷迪阿庫拉
Original assignee: 美商高通公司
Priority date: 2023-03-09
Filing date: 2024-03-05
Publication date: 2024-11-16
Also published as: CN120731597A; KR20250155015A; WO2024186678A1

Abstract

An example computing device may include memory and one or more processors. The one or more processors may be configured to parallel entropy decode encoded video data from a received bitstream to generate entropy decoded data. The one or more processors may be configured to predict a motion vector based on the entropy decoded data. The one or more processors may be configured to decode a motion vector residual from the entropy decoded data. The one or more processors may be configured to add the motion vector residual and motion vector. The one or more processors may be configured to warp previous reconstructed video data with an overlapped block-based warp function using the motion vector to generate predicted current video data. The one or more processors may be configured to sum the predicted current video data with a residual block to generate current reconstructed video data.

Description

Efficient Twisted-based neural video codec

本申請要求於2023年3月9日提交的美國臨時申請號63/489,306的權益和於2023年4月20日提交的美國臨時申請號63/497,411的權益，其各自的全部內容通過引用併入本文。This application claims the benefit of U.S. Provisional Application No. 63/489,306, filed on March 9, 2023, and U.S. Provisional Application No. 63/497,411, filed on April 20, 2023, the entire contents of each of which are incorporated herein by reference.

本公開涉及視頻編碼及解碼，包含圖像及視頻資料的編碼及解碼。The present disclosure relates to video encoding and decoding, including encoding and decoding of image and video data.

數位媒體能力可併入到廣泛範圍的設備中，包含數位電視、數位直播系統、無線廣播系統、個人數位助理（PDA）、膝上型或桌上型計算機、平板計算機、電子書閱讀器、數碼相機、數位記錄設備、數位媒體播放器、視頻戲設備、視頻遊戲控制台、蜂窩式或衛星無線電電話、所謂的「智慧型電話」、視頻電話會議設備、視頻流式傳輸設備等等。數位視頻設備實施視頻編解碼技術，例如由MPEG-2、MPEG-4、ITU-T H.263、ITU-T H.264/MPEG-4第10部分高級視頻編解碼（AVC）、ITU-T H.265/高效視頻編解碼（HEVC）、ITU-T H.266/通用視頻編解碼（VVC）定義的標準及此類標準的擴展中所描述的那些技術，以及專有視頻編解碼器/格式，例如由開放媒體聯盟開發的AOMedia視頻1（AV1）。視頻設備可通過實施此類視頻編解碼技術而更有效地發送、接收、編碼、解碼及/或儲存數位視頻資訊。Digital media capabilities can be incorporated into a wide range of devices, including digital televisions, digital live broadcast systems, wireless broadcast systems, personal digital assistants (PDAs), laptop or desktop computers, tablet computers, e-book readers, digital cameras, digital recording devices, digital media players, video game equipment, video gaming consoles, cellular or satellite telephones, so-called "smart phones", video teleconferencing equipment, video streaming equipment, and many others. Digital video devices implement video codec technologies, such as those described in the standards defined by MPEG-2, MPEG-4, ITU-T H.263, ITU-T H.264/MPEG-4 Part 10 Advanced Video Codec (AVC), ITU-T H.265/High Efficiency Video Codec (HEVC), ITU-T H.266/Versatile Video Codec (VVC), and extensions of such standards, as well as proprietary video codecs/formats, such as AOMedia Video 1 (AV1) developed by the Alliance for Open Media. Video devices can more efficiently send, receive, encode, decode and/or store digital video information by implementing such video codec technologies.

一般來說，本公開描述用於媒體壓縮的技術，包含用於視頻及/或圖像編碼及解碼的技術。基於神經網路的媒體（例如，圖像和/或視頻）壓縮方法可以與當前標準競爭並且提供若干優點。基於神經的編解碼技術通常使用高精度浮點數學來設計和測試。然而，隨著基於神經網路的媒體壓縮技術進入實際的實施方式和部署，神經網路權重和啟動函數通常用低精度整數而不是高精度浮點數來量化和表示，以便提高速度和功耗。In general, this disclosure describes techniques for media compression, including techniques for video and/or image encoding and decoding. Neural network-based media (e.g., image and/or video) compression methods can compete with current standards and provide several advantages. Neural-based encoding and decoding techniques are typically designed and tested using high-precision floating-point mathematics. However, as neural network-based media compression techniques enter actual implementations and deployments, neural network weights and activation functions are often quantized and represented using low-precision integers rather than high-precision floating-point numbers to improve speed and power consumption.

本公開解決了當量化與熵編解碼相關的神經網路變量時發生的問題。神經網路變量對於基於神經的視頻/圖像壓縮方案的設計可能是重要的，因為這樣的變量定義了壓縮效率。此外，神經網路中用於優化量化的一般工具不考慮熵編解碼變量的非常特定的屬性。測試表明，最差的量化效果可能恰好發生在一些最常見的用例上，並且由最差的量化效果引起的損失可能無法通過重新訓練神經網路來恢復。This disclosure addresses the problem that occurs when quantizing neural network variables associated with entropy coding and decoding. Neural network variables may be important for the design of neural-based video/image compression schemes, because such variables define the compression efficiency. In addition, general tools for optimizing quantization in neural networks do not consider the very specific properties of entropy coding and decoding variables. Tests show that the worst quantization effects may happen to occur on some of the most common use cases, and the loss caused by the worst quantization effects may not be recovered by retraining the neural network.

本公開描述優化經訓練的熵編解碼變量的定義的技術，使得對於有效熵編解碼最重要的資訊在用低精度整數表示時被最佳地保留。測試還展示本文中所描述的技術可以如何用於減少或最小化熵編解碼所需的記憶體的量。本公開描述用於熵編解碼設計的一般方法，以及用於常用高斯分佈的特定解決方案和實施方案。本公開的技術一般可以應用於任何基於神經的壓縮技術，但下文所描述的示例聚焦於用於圖像及視頻壓縮的技術。This disclosure describes techniques for optimizing the definition of trained entropy coding and decoding variables so that the information most important for efficient entropy coding and decoding is best preserved when represented by low-precision integers. Tests also show how the techniques described herein can be used to reduce or minimize the amount of memory required for entropy coding and decoding. This disclosure describes a general approach for entropy coding and decoding design, as well as specific solutions and implementations for commonly used Gaussian distributions. The techniques of this disclosure can generally be applied to any neural-based compression technology, but the examples described below focus on techniques for image and video compression.

在一個實例中，一種用於解碼視頻資料的設備包含：用於儲存視頻資料的記憶體，所述視頻資料包括先前重構的視頻資料及當前重構的視頻資料；及一個或多個處理器，其被配置為：對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料；基於所述經熵解碼的資料來預測基於塊的運動向量以生成預測的運動向量；從所述經熵解碼的資料解碼運動向量殘差；將所述運動向量殘差添加到所述預測的運動向量以產生所述基於塊的運動向量；使用所述基於塊的運動向量以基於重疊塊的扭曲函數扭曲所述先前重構的視頻資料以產生預測的當前視頻資料；以及對所述預測的當前視頻資料與殘差塊求和以產生所述當前重構的視頻資料。In one example, a device for decoding video data includes: a memory for storing video data, the video data including previously reconstructed video data and currently reconstructed video data; and one or more processors configured to: perform parallel entropy decoding on encoded video data from a received bit stream to generate entropy decoded data; predict block-based motion vectors based on the entropy decoded data to generate predicted motion vector; decoding a motion vector residual from the entropy decoded data; adding the motion vector residual to the predicted motion vector to generate the block-based motion vector; using the block-based motion vector to warp the previously reconstructed video data with an overlapping block-based warping function to generate predicted current video data; and summing the predicted current video data with the residual block to generate the current reconstructed video data.

在另一實例中，一種解碼視頻資料的方法包含：對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料；基於所述經熵解碼的資料預測基於塊的運動向量以生成預測運動向量；從經熵解碼的資料解碼運動向量殘差；將所述運動向量殘差與所述預測的運動向量相加以產生所述基於塊的運動向量；使用所述基於塊的運動向量以基於重疊塊的扭曲函數扭曲先前重構的視頻資料以產生預測的當前視頻資料；以及將所述預測的當前視頻資料與殘差塊求和以產生當前重構的視頻資料。In another example, a method for decoding video data includes: performing parallel entropy decoding on encoded video data from a received bit stream to generate entropy decoded data; predicting block-based motion vectors based on the entropy decoded data to generate predicted motion vectors; decoding motion vector residues from the entropy decoded data; adding the motion vector residues to the predicted motion vectors to generate the block-based motion vectors; using the block-based motion vectors to warp previously reconstructed video data using a warping function based on overlapping blocks to generate predicted current video data; and summing the predicted current video data with the residue block to generate current reconstructed video data.

在另一實例中，一種用於解碼視頻資料的設備包含：用於對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料的設備；用於基於所述經熵解碼的資料預測基於塊的運動向量以產生預測的運動向量的設備；用於從所述經熵解碼的資料解碼運動向量殘差的設備；用於將所述運動向量殘差與所述預測的運動向量相加以產生所述基於塊的運動向量的設備；用於使用所述基於塊的運動向量以基於重疊塊的扭曲函數扭曲先前重構的視頻資料以產生預測的當前視頻資料的設備；及用於將預測的當前視頻資料與殘差塊求和以產生當前重構的視頻資料的設備。In another example, an apparatus for decoding video data includes: an apparatus for performing parallel entropy decoding on coded video data from a received bit stream to generate entropy decoded data; an apparatus for predicting block-based motion vectors based on the entropy decoded data to generate predicted motion vectors; an apparatus for decoding motion vector residues from the entropy decoded data; and a method for extracting motion vector residues from the entropy decoded data. A device for adding the motion vector residual to the predicted motion vector to generate the block-based motion vector; a device for using the block-based motion vector to warp previously reconstructed video data with a warping function based on overlapping blocks to generate predicted current video data; and a device for summing the predicted current video data with the residual block to generate current reconstructed video data.

在另一實例中，一種計算機可讀儲存媒體經編碼有指令，所述指令在經執行時致使可程式化處理器：對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料；基於所述經熵解碼的資料來預測基於塊的運動向量以生成預測的運動向量；從所述經熵解碼的資料解碼運動向量殘差；將所述運動向量殘差添加到所述預測的運動向量以產生所述基於塊的運動向量；使用所述基於塊的運動向量以基於重疊塊的扭曲函數扭曲先前重構的視頻資料以產生預測的當前視頻資料；以及對所述預測的當前視頻資料與殘差塊求和以產生當前重構的視頻資料。In another example, a computer-readable storage medium is encoded with instructions that, when executed, cause a programmable processor to: perform parallel entropy decoding on encoded video data from a received bitstream to produce entropy-decoded data; predict block-based motion vectors based on the entropy-decoded data to generate predicted motion vectors; and decoding motion vector residuals; adding the motion vector residuals to the predicted motion vectors to generate the block-based motion vectors; using the block-based motion vectors to warp previously reconstructed video data with a warping function based on overlapping blocks to generate predicted current video data; and summing the predicted current video data with the residual block to generate current reconstructed video data.

在附圖和以下描述中闡述了本公開的一個或多個示例的細節。根據說明書和附圖以及申請專利範圍，本公開的其他特徵、目的和優點將是顯而易見的。The details of one or more examples of the present disclosure are described in the accompanying drawings and the following description. Other features, objectives and advantages of the present disclosure will be apparent from the description and the accompanying drawings as well as from the scope of the application.

本公開描述用於使用基於神經網路的媒體編解碼技術來編碼及解碼媒體資料（例如，圖像或視頻）的技術。確切地說，本公開描述用於使用扭曲（warping）來解碼經編碼的媒體資料的技術。確切地說，本公開描述用於使用基於塊的扭曲的基於神經網路的媒體編解碼的技術。本公開的示例技術包括1080p YUV420架構、用於改善壓縮性能的預測建模、量化感知訓練、並行熵編碼（例如，在GPU上）和/或管線推理（pipelined inferencing）。本公開的技術可以改進基於神經網路的媒體編解碼器的性能。這種改進的基於神經網路的媒體編解碼器可以用在電池供電的設備中，諸如行動設備（例如，智慧型電話）。This disclosure describes techniques for encoding and decoding media data (e.g., images or videos) using neural network-based media codec techniques. Specifically, this disclosure describes techniques for decoding encoded media data using warping. Specifically, this disclosure describes techniques for neural network-based media codecs using block-based warping. Example techniques of this disclosure include 1080p YUV420 architecture, predictive modeling for improving compression performance, quantization-aware training, parallel entropy coding (e.g., on a GPU), and/or pipelined inferencing. The techniques of this disclosure can improve the performance of neural network-based media codecs. This improved neural network-based media codec can be used in battery-powered devices such as mobile devices (e.g., smartphones).

神經視頻編解碼器最近在低延遲設置方面變得與標準編解碼器（例如HEVC）具有競爭力。然而，大多數神經編碼器包括大型浮點網路，其使用像素密集扭曲操作進行時間建模，使得它們對於在行動設備上的部署來說計算太密集。本公開描述包含對用於行動部署的強神經編碼器/解碼器（編解碼器）的適配的技術。Neural video codecs have recently become competitive with standard codecs (e.g., HEVC) in low-latency settings. However, most neural codecs consist of large floating-point networks that use pixel-dense warp operations for temporal modeling, making them too computationally intensive for deployment on mobile devices. This disclosure describes techniques for the adaptation of strong neural encoders/decoders (codecs) for mobile deployment.

用於視頻編解碼的現有神經壓縮模型表現出相對良好（例如，高品質）的壓縮性能，但是運行起來在計算上相對昂貴，尤其是在電池供電的設備（諸如行動設備）上。因而，可能需要一種媒體編解碼器，例如視頻編解碼器，其具有相對低的計算佔用空間，也提供相對良好壓縮性能。Existing neural compression models for video codecs exhibit relatively good (e.g., high quality) compression performance, but are relatively computationally expensive to run, especially on battery-powered devices (such as mobile devices). Thus, there may be a need for a media codec, such as a video codec, that has a relatively low computational footprint while also providing relatively good compression performance.

根據本公開的技術，為1080p YUV420視頻提供了相對高效的模型架構。這樣的架構利用預測模型來改善壓縮性能。視頻編解碼器可採用量化感知訓練。視頻編解碼器可採用高效的基於塊的扭曲（block-based warping）。視頻編解碼器可在GPU上採用並行熵編解碼（例如，除CPU上的熵編解碼之外或代替CPU上的熵編解碼）。在一些示例中，視頻編解碼器可使用管線推理來達到1080x2048個視頻幀的＞30 fps的通量。According to the technology of the present disclosure, a relatively efficient model architecture is provided for 1080p YUV420 video. Such an architecture uses a prediction model to improve compression performance. The video codec can adopt quantization-aware training. The video codec can adopt efficient block-based warping. The video codec can adopt parallel entropy coding and decoding on the GPU (for example, in addition to or instead of entropy coding and decoding on the CPU). In some examples, the video codec can use pipeline inference to achieve a throughput of >30 fps for 1080x2048 video frames.

具體地，本公開描述了用於在均值-縮放超先驗模型（mean-scale hyperprior model）中執行低精度權重和啟動量化的技術。本公開還描述了一種在神經加速器中可用於替換像素密集扭曲的高效重疊塊運動補償技術。本公開的技術可提供一種編解碼器，其可能以相對較大的幅度勝過其它實際神經編碼器，具有高達40%的Bjontegaard-Delta（BD）率節省，同時在接收器側將FLOP計數減少到約1/9。這種複雜度降低允許在設備上擴展到即時全高清視頻解碼。此外，所得到的編解碼器可以在感知相關的YUV顏色空間中操作，並且在行動GPU上並行地執行熵編解碼。本公開還描述了編碼器架構的消融（ablation），以提供對運動補償方案和量化的效果的討論。Specifically, the present disclosure describes techniques for performing low-precision weights and activation quantization in a mean-scale hyperprior model. The present disclosure also describes an efficient overlapping block motion compensation technique that can be used in a neural accelerator to replace pixel-dense distortion. The techniques of the present disclosure can provide a codec that may outperform other practical neural codecs by a relatively large margin, with up to 40% Bjontegaard-Delta (BD) rate savings, while reducing the FLOP count to approximately 1/9 on the receiver side. This reduction in complexity allows for expansion to real-time full-HD video decoding on the device. In addition, the resulting codec can operate in a perceptually relevant YUV color space and perform entropy encoding and decoding in parallel on a mobile GPU. This disclosure also describes ablations of encoder architectures to provide a discussion of the effects of motion compensation schemes and quantization.

近年來，神經視頻壓縮取得了重大進展。在低延遲P幀設置中，最近的基於神經網路的壓縮技術優於標準編解碼器的參考實現，如ITU-T H.265/HEVC。然而，當前的神經編解碼器通常在計算上是昂貴的，編碼和解碼不是即時的，並且報告的運行時間通常是在強大的桌面GPU上量測的。另外，許多神經視頻編解碼器假設可以使用基於像素或基於特徵的扭曲操作，它們可能是記憶體密集型的並且難以在資源受限的設備（諸如行動電話）上高效地實現。Significant progress has been made in neural video compression in recent years. In the low-latency P-frame setting, recent neural network-based compression techniques outperform reference implementations of standard codecs such as ITU-T H.265/HEVC. However, current neural codecs are often computationally expensive, encoding and decoding are not real-time, and reported runtimes are often measured on powerful desktop GPUs. In addition, many neural video codecs assume the use of pixel-based or feature-based warping operations, which can be memory intensive and difficult to implement efficiently on resource-constrained devices such as mobile phones.

另一方面，標準編解碼器通常具有快速軟體實現，或專門為消費者硬體設計的高效矽實現。儘管存在高效的神經編解碼器，但是它們通常（1）用卷積運動補償網路代替密集光流扭曲，以及（2）使用僅縮放超先驗（scale-only hyperprior）。然而，即使在應用權重和啟動量化之前，這兩種選擇也對率失真（R-D）性能具有負面影響。On the other hand, standard codecs typically have fast software implementations, or efficient silicon implementations designed specifically for consumer hardware. Although efficient neural codecs exist, they typically (1) replace dense optical flow warps with convolutional motion compensation networks, and (2) use a scale-only hyperprior. However, both choices have a negative impact on rate-distortion (R-D) performance, even before applying weights and enabling quantization.

本公開描述了被設計用於在行動設備中部署的神經P幀編解碼器架構（有時稱為「QODEC」或在設備上量化的端到端編解碼器）。在一些示例中，該架構可以包括用於流和殘差的三個均值-縮放超先驗和預測網路。減小模型寬度、移除殘差預測器以及移除冗餘扭曲操作可以降低計算複雜度。扭曲算子本身可以使用基於塊的運動補償演算法來高效地實現。This disclosure describes a neural P-frame codec architecture designed for deployment in mobile devices (sometimes referred to as "QODEC" or quantized on-device end-to-end codec). In some examples, the architecture can include three mean-scaled hyper-prior and prediction networks for flow and residue. Reducing model width, removing residue predictors, and removing redundant warping operations can reduce computational complexity. The warping operator itself can be efficiently implemented using a block-based motion compensation algorithm.

在一些示例中，量化權重及啟動可為8位元值，其可進一步提高效率。本公開的技術可針對每一均值-縮放超先驗的縮放使用高效量化方案。然而，超先驗均值的樸素量化（naive quantization）可能導致R-D性能的災難性損失。因此，相反，本公開描述了涉及均值-縮放超先驗的平均和縮放參數的低精度量化的替代技術。GPU可執行並行熵編解碼以將並行性大量增加到數萬個線程，從而允許在行動設備上進行極其高效的熵解碼。In some examples, the quantization weights and activations may be 8-bit values, which may further improve efficiency. The disclosed techniques may use an efficient quantization scheme for scaling of each mean-scaled super-prior. However, naive quantization of the super-prior mean may result in a catastrophic loss of R-D performance. Therefore, instead, the disclosed techniques describe an alternative technique involving low-precision quantization of the mean-scaled super-prior mean and scaling parameters. GPUs may perform parallel entropy encoding and decoding to massively increase parallelism to tens of thousands of threads, allowing extremely efficient entropy decoding on mobile devices.

這些技術可以導致BD率節省，同時在行動設備上實現例如30 fps全HD即時解碼。另外，可以選擇扭曲算子和量化，這允許確定用於神經編解碼器的有效的、行動友好的設計的關鍵因素。These techniques can lead to BD rate savings while enabling, for example, 30 fps full HD real-time decoding on mobile devices. Additionally, the warp operator and quantization can be chosen, which allows determining key factors for an efficient, mobile-friendly design of a neural codec.

神經編解碼器是基於神經網路的機器學習系統，其被訓練以壓縮來自示例的資料。用於神經圖像壓縮的一種模型是均值-縮放超先驗。該模型是具有量化的潛變量的分層變分自動編碼器，有時被稱為壓縮自動編碼器。A neural codec is a machine learning system based on a neural network that is trained to compress data from examples. One model used for neural image compression is the mean-scaled hyper-prior. This model is a hierarchical variational autoencoder with quantized latent variables, sometimes called a compressive autoencoder.

在圖像域中成功（例如，靜態圖像的編解碼）之後，神經編解碼器被擴展到視頻設置。後續研究使用了使用任務特定自動編碼器的運動補償和殘差編解碼。使用預測流、殘差或兩者的預測模型進一步增強這些架構，導致RD性能的改善。最近的工作表明，條件編解碼可以比殘差編解碼更強大，但是條件編解碼可能導致聚合誤差。After their success in the image domain (e.g., encoding and decoding of still images), neural codecs were extended to the video setting. Subsequent research used motion compensation and residue codecs using task-specific autoencoders. These architectures were further enhanced using prediction models of the prediction stream, the residue, or both, leading to improvements in RD performance. Recent work has shown that conditional codecs can be more powerful than residue codecs, but conditional codecs can lead to convergence errors.

神經圖像編解碼器優於最強的標準圖像編解碼器，但是通常，它們的強大性能伴隨著計算複雜度的增加。儘管現在許多工作報告了運行時或記憶體使用，但是神經圖像編解碼器的部署受到較少的關注。Neural image codecs outperform the strongest standard image codecs, but typically, their powerful performance comes with increased computational complexity. Although many works report runtime or memory usage, the deployment of neural image codecs has received less attention.

降低複雜度的常用方法是量化。對於神經壓縮，已經研究了神經量化的跨平臺再現性，因為熵編解碼是敏感的並且可能由於浮點捨入誤差而中斷。已經針對權重和啟動兩者研究了訓練後量化，目的是改善整數量化模型與其浮點對應物之間的BD率差距。例如，通道拆分涉及使用自定義動態範圍拆分和量化對量化最敏感的卷積輸出通道，而修剪掉其他通道以保持浮點運算（FLOP）複雜度稍微恆定。然而，此類技術通常假設每通道量化。研究表明，每通道權重量化可以受益於高效的整數演算法。A common approach to reducing complexity is quantization. For neural compression, cross-platform reproducibility of neural quantization has been studied, as entropy encoding decoding is sensitive and may break due to floating-point rounding errors. Post-training quantization has been studied for both weights and activations, with the goal of improving the BD-rate gap between integer quantized models and their floating-point counterparts. For example, channel splitting involves splitting and quantizing the convolution output channels that are most sensitive to quantization using a custom dynamic range, while pruning away other channels to keep the floating-point operation (FLOP) complexity somewhat constant. However, such techniques typically assume per-channel quantization. Studies have shown that per-channel weight quantization can benefit from efficient integer algorithms.

因為本公開的技術可在設備上實施，所以本公開的技術可將硬體友好的每通道權重及每層啟動量化方案用於最終模型。均值-縮放超先驗結構中的瓶頸量化可以以各種方式實現，並且在量化啟動和執行熵編解碼時可能需要仔細考慮。本公開描述了如何在瓶頸中執行量化，特別是對於潛量（latent）和均值（mean）路徑。本公開還描述當對縮放執行8位元量化時，縮放參數的參數化可如何實現率性能的基本上無損失。Because the techniques of this disclosure can be implemented on a device, the techniques of this disclosure can use hardware-friendly per-channel weight and per-layer activation quantization schemes for the final model. Bottleneck quantization in a mean-scaled hyper-prior structure can be implemented in various ways, and may require careful consideration when quantization is enabled and entropy coding and decoding is performed. This disclosure describes how to perform quantization in the bottleneck, especially for latent and mean paths. This disclosure also describes how parameterization of scaling parameters can achieve essentially no loss in rate performance when 8-bit quantization is performed on scaling.

在視頻設置中，一些技術通過運行時或乘加運算來量測計算複雜度。例如，可以使用特定卷積塊來提高推理速度和BD率。將編解碼器過度擬合到實例以進行壓縮可以大幅降低接收器側計算複雜度。在設備上解碼視頻的神經編解碼器可以使用每通道模型量化和並行熵編解碼，包括運動補償子網。In the video setting, some techniques measure computational complexity by runtime or multiply-add operations. For example, specific convolution blocks can be used to improve inference speed and BD rate. Overfitting the codec to instances for compression can significantly reduce receiver-side computational complexity. A neural codec that decodes video on the device can use per-channel model quantization and parallel entropy encoding and decoding, including a motion compensation subnetwork.

本公開的技術包括與YUV-420空間中的預測模型架構組合的高效的基於塊的扭曲技術。另外，這些技術包含GPU上的大規模並行化熵編解碼。因此，這些技術可以在行動設備上以30fps解碼全HD視頻（1080p解析度）。The disclosed techniques include efficient block-based warping techniques combined with a prediction model architecture in YUV-420 space. In addition, these techniques include massively parallel entropy coding and decoding on GPUs. Therefore, these techniques can decode full HD video (1080p resolution) at 30fps on mobile devices.

圖1是示出可執行本公開的技術的示例媒體編碼和解碼系統的方塊圖。在本公開的上下文中，媒體可包含待壓縮的任何數位檔案，包含視頻資料和/或圖像。本公開的示例技術大體上涉及編解碼（編碼及/或解碼）視頻資料及/或圖像資料。雖然將參考媒體編碼和解碼來描述圖1的示例，但是本申請的技術同樣適用於使用基於神經的壓縮技術對任何類型的資料檔案進行編碼和解碼。FIG. 1 is a block diagram illustrating an example media encoding and decoding system that may perform the techniques of the present disclosure. In the context of the present disclosure, media may include any digital file to be compressed, including video data and/or images. The example techniques of the present disclosure generally relate to encoding and decoding (encoding and/or decoding) video data and/or image data. Although the example of FIG. 1 will be described with reference to media encoding and decoding, the techniques of the present application are equally applicable to encoding and decoding any type of data file using neural-based compression techniques.

如圖1中所展示，在此實例中，系統100包含提供待由目的地設備116解碼及顯示的經編碼媒體資料的源設備102。特定來說，源設備102經由計算機可讀媒體110將媒體資料提供到目的地設備116。源設備102及目的地設備116可包括廣泛範圍的設備中的任一者，包含桌上型計算機、筆記型（即，膝上型）計算機、行動設備、平板計算機、機上盒、例如智慧型電話的電話手持機、電視、相機、顯示設備、數位媒體播放器、視頻遊戲控制台、視頻流式傳輸設備、廣播接收器設備或其類似者。在一些情況下，源設備102及目的地設備116可經裝備以用於無線通訊，且因此可被稱作無線通訊設備。1 , in this example, system 100 includes a source device 102 that provides encoded media data to be decoded and displayed by a destination device 116. In particular, source device 102 provides the media data to destination device 116 via computer-readable medium 110. Source device 102 and destination device 116 may include any of a wide range of devices, including desktop computers, notebook (i.e., laptop) computers, mobile devices, tablet computers, set-top boxes, telephone handsets such as smart phones, televisions, cameras, display devices, digital media players, video game consoles, video streaming devices, broadcast receiver devices, or the like. In some cases, source device 102 and destination device 116 may be equipped for wireless communications, and thus may be referred to as wireless communication devices.

在圖1的實例中，源設備102包含視頻源104、記憶體106、視頻編碼器200和輸出介面108。目的地設備116包括輸入介面122、視頻解碼器300、記憶體120和顯示設備118。根據本公開，源設備102的視頻編碼器200和目的地設備116的視頻解碼器300可以被配置為應用用於對基於神經的媒體壓縮系統進行熵編解碼的技術。因此，源設備102表示視頻編碼設備的示例，而目的地設備116表示視頻解碼設備的示例。在其它示例中，源設備和目的地設備可包含其它組件或佈置。舉例來說，源設備102可從外部視頻源（例如外部相機）接收視頻資料。同樣地，目的地設備116可與外部顯示設備介面，而非包含整合顯示設備。In the example of Fig. 1, source device 102 includes video source 104, memory 106, video encoder 200 and output interface 108. Destination device 116 includes input interface 122, video decoder 300, memory 120 and display device 118. According to the present disclosure, the video encoder 200 of source device 102 and the video decoder 300 of destination device 116 can be configured to apply the technology for entropy encoding and decoding of neural-based media compression system. Therefore, source device 102 represents an example of video encoding device, and destination device 116 represents an example of video decoding device. In other examples, source device and destination device may include other components or arrangements. For example, source device 102 may receive video data from an external video source (e.g., an external camera). Similarly, destination device 116 may interface with an external display device rather than including an integrated display device.

如圖1所示，系統100僅是一個示例。通常，任何數位媒體編碼和/或解碼設備可以執行用於對基於神經的媒體壓縮系統進行熵編解碼的技術。源設備102及目的地設備116僅為源設備102產生經編解碼的視頻資料以供發送到目的地設備116的此類編解碼設備的實例。本公開將「編解碼」設備稱作執行資料的編解碼（編碼及/或解碼）的設備。因此，視頻編碼器200和視頻解碼器300分別表示編解碼設備的示例，分別是視頻編碼器和視頻解碼器。在一些示例中，源設備102和目的地設備116可以基本上以對稱的方式操作，使得源設備102和目的地設備116中的每一個包含視頻編碼和解碼組件。因此，系統100可支援源設備102與目的地設備116之間的單向或雙向視頻傳輸，例如用於視頻流式傳輸、視頻回放、視頻廣播或視頻電話。As shown in FIG. 1 , system 100 is only an example. In general, any digital media encoding and/or decoding device can implement techniques for entropy encoding and decoding of a neural-based media compression system. Source device 102 and destination device 116 are only examples of such codec devices where source device 102 generates encoded and decoded video data for transmission to destination device 116. This disclosure refers to a "codec" device as a device that performs encoding and decoding (encoding and/or decoding) of data. Therefore, video encoder 200 and video decoder 300 represent examples of codec devices, respectively, a video encoder and a video decoder. In some examples, source device 102 and destination device 116 can operate in a substantially symmetrical manner such that each of source device 102 and destination device 116 includes video encoding and decoding components. Thus, system 100 can support one-way or two-way video transmission between source device 102 and destination device 116, such as for video streaming, video playback, video broadcasting, or video telephony.

通常，視頻源104表示視頻資料（即，原始、未編碼的視頻資料）的源，並將視頻資料的一系列連續畫面（也稱為「幀」）提供給視頻編碼器200，視頻編碼器200對畫面的資料進行編碼。源設備102的視頻源104可包含視頻擷取設備，例如攝像機、含有先前擷取的原始視頻的視頻存檔和/或用以從視頻內容提供者接收視頻的視頻饋入介面。作為另一替代方案，視頻源104可產生基於計算機圖形的資料作為源視頻，或實況視頻、存檔視頻及計算機產生的視頻的組合。在每種情況下，視頻編碼器200對擷取的、預擷取的或計算機生成的視頻資料進行編碼。視頻編碼器200可以將畫面從接收到的順序（有時稱為「顯示順序」）重新排列成用於編解碼的編解碼順序。視頻編碼器200可以生成包括經編碼的視頻資料的位元流。源設備102接著可經由輸出介面108將經編碼的視頻資料輸出到計算機可讀媒體110上以用於由例如目的地設備116的輸入介面122接收和/或檢索。Generally, video source 104 represents a source of video data (i.e., raw, unencoded video data) and provides a series of consecutive pictures (also referred to as "frames") of the video data to video encoder 200, which encodes the picture data. Video source 104 of source device 102 may include a video capture device such as a camera, a video archive containing previously captured raw video, and/or a video feed interface for receiving video from a video content provider. As another alternative, video source 104 may generate computer graphics-based data as source video, or a combination of live video, archived video, and computer-generated video. In each case, video encoder 200 encodes captured, pre-captured, or computer-generated video data. Video encoder 200 may rearrange the pictures from the order in which they are received (sometimes referred to as "display order") into a codec order for encoding and decoding. Video encoder 200 may generate a bitstream comprising the encoded video data. Source device 102 may then output the encoded video data onto computer-readable medium 110 via output interface 108 for receipt and/or retrieval by, for example, input interface 122 of destination device 116 .

源設備102的記憶體106和目的地設備116的記憶體120表示通用記憶體。在一些示例中，記憶體106、120可以儲存原始視頻資料，例如，來自視頻源104的原始視頻和來自視頻解碼器300的原始解碼視頻資料。附加地或替代地，記憶體106、120可以儲存可分別由例如視頻編碼器200和視頻解碼器300執行的軟體指令。儘管在該示例中記憶體106和記憶體120與視頻編碼器200和視頻解碼器300分開示出，但是應當理解，視頻編碼器200和視頻解碼器300還可以包括用於功能類似或均等目的的內部記憶體。此外，記憶體106、120可以儲存經編碼的視頻資料，例如，從視頻編碼器200和輸入介面122輸出到視頻解碼器300。在一些示例中，記憶體106、120的部分可被分配為一個或多個緩衝器，例如以儲存原始、經解碼及/或經編碼的視頻資料。The memory 106 of the source device 102 and the memory 120 of the destination device 116 represent general purpose memories. In some examples, the memories 106, 120 can store raw video data, such as raw video from the video source 104 and raw decoded video data from the video decoder 300. Additionally or alternatively, the memories 106, 120 can store software instructions that can be executed by, for example, the video encoder 200 and the video decoder 300, respectively. Although in this example the memory 106 and the memory 120 are shown separately from the video encoder 200 and the video decoder 300, it should be understood that the video encoder 200 and the video decoder 300 may also include internal memory for functionally similar or equivalent purposes. In addition, the memory 106, 120 may store encoded video data, for example, output from the video encoder 200 and the input interface 122 to the video decoder 300. In some examples, portions of the memory 106, 120 may be allocated as one or more buffers, for example to store original, decoded and/or encoded video data.

計算機可讀媒體110可表示能夠將經編碼的視頻資料從源設備102傳輸到目的地設備116的任何類型的媒體或設備。在一個示例中，計算機可讀媒體110表示使得源設備102能夠例如經由射頻網路或基於計算機的網路即時地將經編碼的視頻資料直接發送到目的地設備116的通訊媒體。根據通訊標準（例如，無線通訊協議），輸出介面108可調變包含經編碼的視頻資料的發送訊號，且輸入介面122可解調所接收的發送訊號。通訊媒體可包括任何無線或有線通訊媒體，例如射頻（RF）頻譜或一個或多個實體傳輸線。通訊媒體可形成基於封包的網路（例如區域網、廣域網或例如因特網的全球網路）的部分。通訊媒體可包含路由器、交換機、基站或可用於促進從源設備102到目的地設備116的通訊的任何其它設備。Computer-readable medium 110 may represent any type of medium or device capable of transmitting encoded video data from source device 102 to destination device 116. In one example, computer-readable medium 110 represents a communication medium that enables source device 102 to transmit encoded video data directly to destination device 116 in real time, such as via a radio frequency network or a computer-based network. Output interface 108 may modulate a transmit signal containing the encoded video data, and input interface 122 may demodulate a received transmit signal, according to a communication standard (e.g., a wireless communication protocol). The communication medium may include any wireless or wired communication medium, such as a radio frequency (RF) spectrum or one or more physical transmission lines. The communication medium may form part of a packet-based network such as a local area network, a wide area network, or a global network such as the Internet. The communication medium may include routers, switches, base stations, or any other equipment that can be used to facilitate communication from the source device 102 to the destination device 116.

在一些實例中，源設備102可將經編碼的資料從輸出介面108輸出到儲存設備112。類似地，目的地設備116可經由輸入介面122從儲存設備112存取經編碼的資料。儲存設備112可包含多種分布式或本地存取的資料儲存中的任一者，例如硬驅動、藍光光碟、DVD、CD-ROM、快閃記憶體、易失性或非易失性記憶體，或用於儲存經編碼的視頻資料的任何其它合適的數位儲存。In some examples, source device 102 can output the encoded data from output interface 108 to storage device 112. Similarly, destination device 116 can access the encoded data from storage device 112 via input interface 122. Storage device 112 can include any of a variety of distributed or locally accessed data storage, such as a hard drive, Blu-ray Disc, DVD, CD-ROM, flash memory, volatile or non-volatile memory, or any other suitable digital storage for storing encoded video data.

在一些實例中，源設備102可將經編碼的視頻資料輸出到檔案伺服器114或可儲存由源設備102產生的經編碼的視頻資料的另一中間儲存設備。目的地設備116可經由流式傳輸或下載從檔案伺服器114存取所儲存的視頻資料。In some examples, source device 102 may output the encoded video data to file server 114 or another intermediate storage device that may store the encoded video data generated by source device 102. Destination device 116 may access the stored video data from file server 114 via streaming or downloading.

檔案伺服器114可為能夠儲存經編碼的視頻資料且將所述經編碼的視頻資料發送到目的地設備116的任何類型的伺服器設備。檔案伺服器114可表示web伺服器（例如，用於網站）、被配置為提供檔案傳輸協議服務（例如檔案傳輸協議（FTP）或單向傳輸檔案遞送（FLUTE）協議）的伺服器、內容遞送網路（CDN）設備、超文本傳輸協議（HTTP）伺服器、多媒體廣播多播服務（MBMS）或增強型MBMS（eMBMS）伺服器及/或網路附接儲存（NAS）設備。檔案伺服器114可另外或替代地實施一個或多個HTTP流式傳輸協議，例如通過HTTP的動態自適應流式傳輸（DASH）、HTTP直播流式傳輸（HLS）、即時流式傳輸協議（RTSP）、HTTP動態流式傳輸或其類似者。The file server 114 may be any type of server device capable of storing encoded video data and sending the encoded video data to the destination device 116. The file server 114 may represent a web server (e.g., for a website), a server configured to provide file transfer protocol services (e.g., File Transfer Protocol (FTP) or File Delivery over One-Way Transport (FLUTE) protocol), a content delivery network (CDN) device, a hypertext transfer protocol (HTTP) server, a Multimedia Broadcast Multicast Service (MBMS) or enhanced MBMS (eMBMS) server, and/or a network attached storage (NAS) device. The file server 114 may additionally or alternatively implement one or more HTTP streaming protocols, such as Dynamic Adaptive Streaming over HTTP (DASH), HTTP Live Streaming (HLS), Real-Time Streaming Protocol (RTSP), HTTP Dynamic Streaming, or the like.

目的地設備116可通過任何標準資料連接（包含因特網連接）從檔案伺服器114存取經編碼的視頻資料。這可包含適合於存取儲存在檔案伺服器114上的經編碼的視頻資料的無線通道（例如，Wi-Fi連接）、有線連接（例如，數位訂戶線（DSL）、電纜數據機等）或兩者的組合。輸入介面122可被配置為根據上文所論述的用於從檔案伺服器114檢索或接收視頻資料的各種協議中的任何一個或多個或用於檢索視頻資料的其它此類協議操作。Destination device 116 may access the encoded video data from file server 114 through any standard data connection, including an Internet connection. This may include a wireless channel (e.g., a Wi-Fi connection), a wired connection (e.g., a digital subscriber line (DSL), a cable modem, etc.), or a combination of both that is suitable for accessing the encoded video data stored on file server 114. Input interface 122 may be configured to operate according to any one or more of the various protocols discussed above for retrieving or receiving video data from file server 114 or other such protocols for retrieving video data.

輸出介面108和輸入介面122可表示無線發送器/接收器、數據機、有線聯網組件（例如，乙太網卡）、根據各種IEEE 802.11標準中的任一者操作的無線通訊組件，或其它實體組件。在輸出介面108和輸入介面122包括無線組件的示例中，輸出介面108和輸入介面122可被配置為根據蜂窩式通訊標準（例如，4G、4G-LTE（長期演進）、LTE高級、5G或類似者）傳送資料（例如，經編碼的視頻資料）。在其中輸出介面108包括無線發送器的一些示例中，輸出介面108及輸入介面122可被配置為根據其它無線標準（例如，IEEE 802.11規範、IEEE 802.15規範（例如，ZigBee™）、Bluetooth™標準或其類似者）傳送資料（例如，經編碼的視頻資料）。在一些示例中，源設備102和/或目的地設備116可以包括相應的單晶片系統（SoC）設備。例如，源設備102可以包括用於執行歸屬於視頻編碼器200和/或輸出介面108的功能的SoC設備，並且目的地設備116可以包括用於執行歸屬於視頻解碼器300和/或輸入介面122的功能的SoC設備。The output interface 108 and the input interface 122 may represent wireless transmitters/receivers, modems, wired networking components (e.g., Ethernet cards), wireless communication components operating according to any of the various IEEE 802.11 standards, or other physical components. In examples where the output interface 108 and the input interface 122 include wireless components, the output interface 108 and the input interface 122 may be configured to transmit data (e.g., encoded video data) according to a cellular communication standard (e.g., 4G, 4G-LTE (Long Term Evolution), LTE Advanced, 5G, or the like). In some examples where the output interface 108 includes a wireless transmitter, the output interface 108 and the input interface 122 may be configured to transmit data (e.g., encoded video data) according to other wireless standards (e.g., IEEE 802.11 specifications, IEEE 802.15 specifications (e.g., ZigBee™), Bluetooth™ standards, or the like). In some examples, the source device 102 and/or the destination device 116 may include corresponding single-chip system (SoC) devices. For example, the source device 102 may include a SoC device for performing functions attributed to the video encoder 200 and/or the output interface 108, and the destination device 116 may include a SoC device for performing functions attributed to the video decoder 300 and/or the input interface 122.

本公開的技術可應用於支援多種多媒體應用中的任一者的視頻編碼，例如空中電視廣播、有線電視傳輸、衛星電視傳輸、因特網流式傳輸視頻傳輸（例如通過HTTP的動態自適應流式傳輸（DASH））、編碼到資料儲存媒體上的數位視頻、儲存於資料儲存媒體上的數位視頻的解碼，或其它應用。The techniques disclosed herein may be applied to video encoding in support of any of a variety of multimedia applications, such as over-the-air television broadcasting, cable television transmission, satellite television transmission, Internet streaming video transmission (e.g., Dynamic Adaptive Streaming over HTTP (DASH)), digital video encoded to data storage media, decoding of digital video stored on data storage media, or other applications.

目的地設備116的輸入介面122從計算機可讀媒體110（例如，通訊媒體、儲存設備112、檔案伺服器114等）接收經編碼的視頻位元流。經編碼的視頻位元流可以包括由視頻編碼器200定義的信令資訊，其也由視頻解碼器300使用。顯示設備118向用戶顯示經解碼的視頻資料的經解碼的畫面。顯示設備118可以表示各種顯示設備中的任何一種，諸如液晶顯示器（LCD）、電漿顯示器、有機發光二極體（OLED）顯示器或另一類型的顯示設備。The input interface 122 of the destination device 116 receives the encoded video bit stream from the computer-readable medium 110 (e.g., communication media, storage device 112, file server 114, etc.). The encoded video bit stream may include signaling information defined by the video encoder 200, which is also used by the video decoder 300. The display device 118 displays the decoded picture of the decoded video data to the user. The display device 118 may represent any of a variety of display devices, such as a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, or another type of display device.

根據本公開的技術，視頻解碼器300可為經量化設備上高效編碼器。視頻解碼器300可以使用在設備上高效運行的為YUV 4:2:0顏色空間設計的基於扭曲和殘差的超先驗模型。視頻解碼器300可以使用有效的基於塊的扭曲演算法，並且利用流和像素空間輸入的較低維度，這可以降低模型複雜度。另外，視頻解碼器300可以使用8位元整數對權重和啟動進行建模。視頻解碼器300還可以在GPU核上使用並行熵編解碼演算法，並且使用利用神經處理器中的多個可用核的管線推理演算法。總之，這些各種技術可以允許視頻解碼器300以超過30fps解碼全HD視頻（1080×1920）。According to the technology of the present disclosure, the video decoder 300 can be an efficient encoder on a quantized device. The video decoder 300 can use a hyper-prior model based on distortion and residue designed for YUV 4:2:0 color space that runs efficiently on the device. The video decoder 300 can use an efficient block-based distortion algorithm and take advantage of the lower dimensionality of stream and pixel space inputs, which can reduce model complexity. In addition, the video decoder 300 can use 8-bit integers to model weights and activations. The video decoder 300 can also use parallel entropy encoding and decoding algorithms on GPU cores and use pipeline inference algorithms that take advantage of multiple available cores in a neural processor. Together, these various techniques may allow the video decoder 300 to decode full HD video (1080×1920) at greater than 30 fps.

例如，視頻解碼器300可以接收YUV 4:2:0輸入並使用流預測器。YUV顏色空間可以比其他顏色空間更好地與人類感知品質對準，並且4:2:0子取樣方案可以利用亮度和顏色之間的人眼靈敏度的差異。特別地，視頻解碼器300可以沿著高度和寬度維度按照兩倍（2x）對色度通道進行子取樣，導致與RGB或YUV 4:4:4顏色空間相比，元素數量減少2x，這進而降低了網路複雜度。For example, the video decoder 300 may receive a YUV 4:2:0 input and use a flow predictor. The YUV color space may be better aligned with human perception quality than other color spaces, and the 4:2:0 subsampling scheme may take advantage of differences in human eye sensitivity between brightness and color. In particular, the video decoder 300 may subsample the chroma channels by a factor of two (2x) along the height and width dimensions, resulting in a 2x reduction in the number of elements compared to RGB or YUV 4:4:4 color spaces, which in turn reduces network complexity.

儘管圖1中未示出，但是在一些示例中，視頻編碼器200和視頻解碼器300可以各自與音頻編碼器和/或音頻解碼器整合，並且可以包括適當的MUX-DEMUX單元或其他硬體和/或軟體，以處理包括公共資料流中的音頻和視頻兩者的多工流。Although not shown in FIG. 1 , in some examples, the video encoder 200 and the video decoder 300 may each be integrated with an audio encoder and/or an audio decoder and may include appropriate MUX-DEMUX units or other hardware and/or software to process multiplexed streams including both audio and video in a common data stream.

視頻編碼器200和視頻解碼器300各自可以被實現為各種合適的編碼器和/或解碼器電路中的任何一種，諸如一個或多個微處理器、數位訊號處理器（DSP）、專用積體電路（ASIC）、現場可程式化閘陣列（FPGA）、離散邏輯、軟體、硬體、韌體或其任何組合。當部分地以軟體實施所述技術時，設備可將用於軟體的指令儲存於合適的非暫時性計算機可讀媒體中，且在使用一個或多個處理器的硬體中執行所述指令以執行本公開的技術。視頻編碼器200和視頻解碼器300中的每一個可以包括在一個或多個編碼器或解碼器中，編碼器或解碼器中的任一個可以整合為相應設備中的組合編碼器/解碼器（CODER）的一部分。包括視頻編碼器200和/或視頻解碼器300的設備可以包括積體電路、微處理器和/或無線通訊設備，諸如蜂窩電話。The video encoder 200 and the video decoder 300 can each be implemented as any of a variety of suitable encoder and/or decoder circuits, such as one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), discrete logic, software, hardware, firmware, or any combination thereof. When the technology is partially implemented in software, the device can store instructions for the software in a suitable non-transitory computer-readable medium and execute the instructions in hardware using one or more processors to perform the technology of the present disclosure. Each of the video encoder 200 and the video decoder 300 may be included in one or more encoders or decoders, either of which may be integrated as part of a combined encoder/decoder (CODER) in a corresponding device. A device including the video encoder 200 and/or the video decoder 300 may include an integrated circuit, a microprocessor, and/or a wireless communication device, such as a cellular phone.

熵編解碼是視頻壓縮系統的基本部分。熵編解碼過程負責優化視頻資訊和壓縮資料位元流之間的轉換，旨在獲得可能的最緊湊表示。與視頻壓縮的其他元素不同，熵編解碼是無損過程，即，它完全保留資訊。Entropy coding and decoding is an essential part of video compression systems. The entropy coding and decoding process is responsible for optimizing the conversion between video information and compressed data bit stream, aiming to obtain the most compact representation possible. Unlike other elements of video compression, entropy coding and decoding is a lossless process, that is, it fully preserves the information.

為了在圖像和視頻壓縮標準中實現高效的熵編解碼，開發了幾種技術。最近，已經表明，基於深度學習和神經網路的新壓縮方法正在接近常規方法的性能，同時提供若干其他實際優點。Several techniques have been developed to achieve efficient entropy coding and decoding in image and video compression standards. Recently, it has been shown that new compression methods based on deep learning and neural networks are approaching the performance of conventional methods while offering several other practical advantages.

圖2是示出可執行本公開的技術的實例計算設備的方塊圖。計算設備402可以包括行動設備（諸如例如，智慧型電話、行動電話、蜂窩電話、衛星電話和/或行動電話手持機）、個人計算機、臺式計算機、膝上型計算機、計算機工作站、視頻遊戲平臺或控制台、陸線電話、網際網路電話、手持設備（諸如便攜式視頻戲設備或個人數位助理（PDA））、個人音樂播放器、視頻播放器、顯示設備、電視、電視機上盒、伺服器、中間網路設備、大型計算機、行動計算設備、車輛頭端單元、自駕駛或自主駕駛車輛、機器人或具有成像或視頻能力的任何其他類型的設備。FIG. 2 is a block diagram illustrating an example computing device that may implement the techniques of the present disclosure. Computing device 402 may include a mobile device (such as, for example, a smart phone, a mobile phone, a cellular phone, a satellite phone, and/or a mobile phone handset), a personal computer, a desktop computer, a laptop computer, a computer workstation, a video game platform or console, a landline phone, an Internet phone, a handheld device (such as a portable video game device or a personal digital assistant (PDA)), a personal music player, a video player, a display device, a television, a television set-top box, a server, a middle network device, a mainframe computer, a mobile computing device, a vehicle head end unit, a self-driving or autonomous vehicle, a robot, or any other type of device with imaging or video capabilities.

如圖2的示例所示，計算設備402包括用戶輸入介面404、CPU 406、記憶體控制器408、系統記憶體410、圖形處理單元（GPU）412、神經網路訊號處理器（NSP）430、可以在NSP430中實現的幀內插（FINT）核432、本地記憶體414、顯示介面416、顯示器418、匯流排420和一個或多個相機424。用戶輸入介面404、CPU 406、記憶體控制器408、GPU 412、FINT核432、NSP 430、顯示介面416和一個或多個相機424可以使用匯流排420彼此通訊。匯流排420可以是各種匯流排結構中的任何一種，諸如第三代匯流排（例如，超傳輸匯流排或無限頻寬（InfiniBand）匯流排）、第二代匯流排（例如，高級圖形埠匯流排、外圍組件互連（PCI）快速匯流排或高級可擴展介面（AXI）匯流排）或另一類型的匯流排或設備互連。應當注意，圖2中所示的不同組件之間的匯流排和通訊介面的具體配置僅僅是示例性的，並且具有相同或不同組件的計算設備和/或其他圖形處理系統的其他配置可以用於實現本公開的技術。2 , computing device 402 includes user input interface 404, CPU 406, memory controller 408, system memory 410, graphics processing unit (GPU) 412, neural network signal processor (NSP) 430, frame interpolation (FINT) core 432 that may be implemented in NSP 430, local memory 414, display interface 416, display 418, bus 420, and one or more cameras 424. User input interface 404, CPU 406, memory controller 408, GPU 412, FINT core 432, NSP 430, display interface 416, and one or more cameras 424 may communicate with each other using bus 420. The bus 420 may be any of a variety of bus structures, such as a third generation bus (e.g., a SuperTransport bus or an InfiniBand bus), a second generation bus (e.g., an Advanced Graphics Port bus, a Peripheral Component Interconnect (PCI) Express bus, or an Advanced eXtensible Interface (AXI) bus), or another type of bus or device interconnect. It should be noted that the specific configuration of the bus and communication interface between the different components shown in FIG. 2 is merely exemplary, and other configurations of computing devices and/or other graphics processing systems having the same or different components may be used to implement the techniques of the present disclosure.

一個或多個相機424可以包括任何圖像擷取硬體，其包括一個或多個圖像感測器和一個或多個鏡頭，並且被配置為擷取至少一幀圖像資料並將至少一幀圖像資料傳送到CPU 406、GPU 412、NSP 430和/或FINT核432。The one or more cameras 424 may include any image capture hardware including one or more image sensors and one or more lenses, and is configured to capture at least one frame of image data and transmit at least one frame of image data to the CPU 406, GPU 412, NSP 430 and/or FINT core 432.

（多個）CPU 406可包括控制計算設備402的操作的一個或多個通用和/或專用處理器。用戶可以向計算設備402提供輸入以使CPU 406執行一個或多個軟體應用。在CPU 406上執行的軟體應用可包含（例如）作業系統、文字處理器應用、電子郵件應用、電子表格應用、媒體播放器應用、視頻遊戲應用、圖形用戶介面應用及/或其它程式。用戶可以經由一個或多個輸入設備（未示出）向計算設備402提供輸入，諸如鍵盤、滑鼠、麥克風、觸摸板或經由用戶輸入介面404耦接到計算設備402的另一輸入設備。(Multiple) CPU 406 may include one or more general and/or special purpose processors that control the operation of computing device 402. A user may provide input to computing device 402 to cause CPU 406 to execute one or more software applications. Software applications executed on CPU 406 may include, for example, an operating system, a word processor application, an email application, a spreadsheet application, a media player application, a video game application, a graphical user interface application, and/or other programs. A user may provide input to computing device 402 via one or more input devices (not shown), such as a keyboard, a mouse, a microphone, a touchpad, or another input device coupled to computing device 402 via user input interface 404.

記憶體控制器408促進進入和離開系統記憶體410的資料的傳送。例如，記憶體控制器408可以接收記憶體讀取和寫入命令，並且關於系統記憶體410服務這樣的命令，以便為計算設備402中的組件提供記憶體服務。記憶體控制器408通訊地耦接到系統記憶體410。儘管記憶體控制器408在圖1的示例計算設備402中被示出為與（多個）CPU406和系統記憶體410兩者分離的處理模組，但是在其他示例中，記憶體控制器408的一些或全部功能可以在（多個）CPU406和系統記憶體410中的一個或兩者上實現。The memory controller 408 facilitates the transfer of data into and out of the system memory 410. For example, the memory controller 408 can receive memory read and write commands and service such commands with respect to the system memory 410 to provide memory services to the components in the computing device 402. The memory controller 408 is communicatively coupled to the system memory 410. Although the memory controller 408 is shown in the example computing device 402 of FIG. 1 as a processing module separate from both the CPU(s) 406 and the system memory 410, in other examples, some or all of the functionality of the memory controller 408 may be implemented on one or both of the CPU(s) 406 and the system memory 410.

系統記憶體410可以儲存可存取以供CPU406執行的程式模組和/或指令和/或供在CPU406上執行的程式使用的資料。例如，系統記憶體410可以儲存用戶應用和與應用相關聯的圖形資料。系統記憶體410可以附加地儲存由計算設備402的其他組件使用和/或生成的資訊。例如，系統記憶體410可以充當用於一個或多個GPU 412的設備記憶體，並且可以儲存要由GPU 412操作的資料以及由GPU 412執行的操作產生的資料。系統記憶體410可包含一個或多個易失性或非易失性記憶體或儲存設備，例如隨機存取記憶體（RAM）、靜態RAM（SRAM）、動態RAM（DRAM）、唯讀記憶體（ROM）、可擦除可程式化ROM（EPROM）、電可擦除可程式化ROM（EEPROM）、快閃記憶體、磁性資料媒體或光學儲存媒體。System memory 410 may store program modules and/or instructions accessible for execution by CPU 406 and/or data used by programs executing on CPU 406. For example, system memory 410 may store user applications and graphics data associated with the applications. System memory 410 may additionally store information used and/or generated by other components of computing device 402. For example, system memory 410 may serve as device memory for one or more GPUs 412 and may store data to be operated on by GPU 412 and data generated by operations performed by GPU 412. The system memory 410 may include one or more volatile or nonvolatile memory or storage devices, such as random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, magnetic data media, or optical storage media.

在一些方面，系統記憶體410可以包括使得CPU 406、GPU 412、NSP 430和/或FINT核432執行在本公開中歸屬於CPU 406、GPU 412、NSP 430和FINT核432的功能的指令。因此，系統記憶體410可以是其上儲存有指令的計算機可讀儲存媒體，所述指令在被執行時使得一個或多個處理器（例如，（多個）CPU 406、（多個）GPU 412、（多個）NSP 430和FINT核432）執行各種功能。In some aspects, the system memory 410 may include instructions that cause the CPU 406, GPU 412, NSP 430, and/or FINT core 432 to perform functions attributed in the present disclosure to the CPU 406, GPU 412, NSP 430, and FINT core 432. Thus, the system memory 410 may be a computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors (e.g., CPU(s) 406, GPU(s) 412, NSP(s) 430, and FINT core 432) to perform various functions.

在一些示例中，系統記憶體410是非暫時性儲存媒體。術語「非暫時性」指示儲存媒體不體現在載波或傳播訊號中。然而，術語「非暫時性」不應被解釋為意味著系統記憶體410是不可移動的或其內容是靜態的。作為一個示例，系統記憶體410可以從計算設備402移除，並移動到另一設備。作為另一示例，可以將基本上類似於系統記憶體410的記憶體插入到計算設備402中。在某些示例中，非暫時性儲存媒體可儲存可隨時間改變的資料（例如，在RAM中）。In some examples, system memory 410 is a non-transitory storage medium. The term "non-transitory" indicates that the storage medium is not embodied in a carrier or propagated signal. However, the term "non-transitory" should not be interpreted to mean that system memory 410 is non-removable or that its contents are static. As one example, system memory 410 can be removed from computing device 402 and moved to another device. As another example, memory substantially similar to system memory 410 can be inserted into computing device 402. In some examples, non-transitory storage media can store data that can change over time (e.g., in RAM).

GPU 412可被配置為執行圖形操作以將一個或多個圖形基元渲染到顯示器418。因此，當在（多個）CPU 406上執行的軟體應用之一需要圖形處理時，（多個）CPU 406可向（多個）GPU 412提供圖形命令和圖形資料以供渲染到顯示器418。圖形命令可以包括例如繪製命令（諸如繪製調用）、GPU狀態程式化命令、記憶體傳輸命令、通用計算命令、核執行命令等。在一些示例中，CPU 406可通過將命令和圖形資料寫入到可由GPU 412存取的系統記憶體410而將命令和圖形資料提供到GPU 412。在一些示例中，（多個）GPU 412可被進一步配置成為在（多個）CPU 406上執行的應用執行通用計算。The GPU 412 may be configured to perform graphics operations to render one or more graphics primitives to the display 418. Thus, when one of the software applications executing on the CPU(s) 406 requires graphics processing, the CPU(s) 406 may provide graphics commands and graphics data to the GPU(s) 412 for rendering to the display 418. The graphics commands may include, for example, drawing commands (such as draw calls), GPU state programming commands, memory transfer commands, general computation commands, kernel execution commands, etc. In some examples, the CPU 406 may provide the commands and graphics data to the GPU 412 by writing the commands and graphics data to the system memory 410 that may be accessed by the GPU 412. In some examples, GPU(s) 412 may be further configured to perform general purpose computations for applications executing on CPU(s) 406.

在一些情況下，GPU 412可以用高度並行的結構構建，該高度並行的結構提供比CPU 406更高效的向量操作處理。例如，（多個）GPU 412可以包括被配置為以並行方式對多個頂點或像素進行操作的多個處理元件。在一些情況下，GPU 412的高度並行性質可以允許GPU 412比使用CPU 406將場景直接繪製到顯示器418更快地將圖形圖像（例如，GUI和二維（2D）和/或三維（3D）圖形場景）繪製到顯示器418上。另外，（多個）GPU 412的高度並行性質可允許（多個）GPU 412比（多個）CPU406更快地處理用於通用計算應用的某些類型的向量和矩陣運算。在一些示例中，視頻編碼器200或視頻解碼器300可利用GPU 412的高度並行結構來根據本公開的技術執行並行熵編解碼。In some cases, GPU 412 may be constructed with a highly parallel structure that provides more efficient vector operation processing than CPU 406. For example, GPU(s) 412 may include multiple processing elements configured to operate on multiple vertices or pixels in a parallel manner. In some cases, the highly parallel nature of GPU 412 may allow GPU 412 to draw graphical images (e.g., GUIs and two-dimensional (2D) and/or three-dimensional (3D) graphical scenes) to display 418 faster than using CPU 406 to draw scenes directly to display 418. In addition, the highly parallel nature of GPU(s) 412 may allow GPU(s) 412 to process certain types of vector and matrix operations for general computing applications faster than CPU(s) 406. In some examples, video encoder 200 or video decoder 300 may utilize the highly parallel structure of GPU 412 to perform parallel entropy encoding and decoding according to the techniques of this disclosure.

在一些實例中，（多個）GPU 412可以整合到計算設備402的主板中。在其他實例中，（多個）GPU412可以存在於圖形卡上，該圖形卡安裝在計算設備402的主板中的埠中，或者可以以其他方式併入被配置為與計算設備402互操作的外圍設備內。在進一步的實例中，（多個）GPU 412可以位於與（多個） CPU 406相同的微晶片上，從而形成單晶片系統（SoC）。GPU 412和CPU 406可包含一個或多個處理器，例如一個或多個微處理器、專用積體電路（ASIC）、現場可程式化閘陣列（FPGA）、數位訊號處理器（DSP）或其它等效整合或離散邏輯電路。In some examples, GPU(s) 412 may be integrated into the motherboard of computing device 402. In other examples, GPU(s) 412 may reside on a graphics card that is mounted in a port in the motherboard of computing device 402, or may otherwise be incorporated into a peripheral device configured to interoperate with computing device 402. In further examples, GPU(s) 412 may be located on the same microchip as CPU(s) 406, thereby forming a system on a chip (SoC). GPU 412 and CPU 406 may include one or more processors, such as one or more microprocessors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), digital signal processors (DSPs), or other equivalent integrated or discrete logic circuits.

（多個）CPU 406、（多個）GPU 412、（多個）NSP 430和FINT核432可以一起被稱為一個或多個處理器440。在描述可以由一個或多個處理器440執行的各種技術時，應當理解，這樣的技術可以由CPU 406、GPU 412、NSP 430和FINT核432中的一個或多個執行。應當理解，本文公開的技術不一定限於由CPU 406、GPU 412、NSP 430和/或FINT核432執行，而是還可以由計算設備402的任何其他合適的硬體、設備、邏輯、電路、處理單元等執行。CPU(s) 406, GPU(s) 412, NSP(s) 430, and FINT core 432 may be collectively referred to as one or more processors 440. When describing various techniques that may be performed by one or more processors 440, it should be understood that such techniques may be performed by one or more of the CPU 406, GPU 412, NSP 430, and FINT core 432. It should be understood that the techniques disclosed herein are not necessarily limited to being performed by the CPU 406, GPU 412, NSP 430, and/or FINT core 432, but may also be performed by any other suitable hardware, device, logic, circuitry, processing unit, etc. of the computing device 402.

GPU 412可以直接耦接到本地記憶體414。因此，（多個）GPU 412可以從本地記憶體414讀取資料和向本地記憶體414寫入資料，而不必使用匯流排420。換句話說，（多個）GPU 412可以使用本地儲存設備而不是片外記憶體來本地處理資料。這允許GPU 412通過消除GPU 412經由匯流排420讀取和寫入資料（這可能經歷繁重的匯流排流量）的需要來以更有效的方式操作。然而，在一些實例中，（多個）GPU 412可以不包括單獨的高速緩存，而是經由匯流排420利用系統記憶體410。本地記憶體414可以包括一個或多個易失性或非易失性記憶體或儲存設備，例如隨機存取記憶體（RAM）、靜態RAM（SRAM）、動態RAM（DRAM）、可擦除可程式化ROM（EPROM）、電可擦除可程式化ROM（EEPROM）、快閃記憶體、磁資料媒體或光儲存媒體。The GPU 412 may be directly coupled to the local memory 414. Thus, the (multiple) GPUs 412 may read and write data from and to the local memory 414 without having to use the bus 420. In other words, the (multiple) GPUs 412 may process data locally using local storage devices rather than off-chip memory. This allows the GPUs 412 to operate in a more efficient manner by eliminating the need for the GPUs 412 to read and write data via the bus 420, which may experience heavy bus traffic. However, in some instances, the (multiple) GPUs 412 may not include a separate cache, but instead utilize the system memory 410 via the bus 420. The local memory 414 may include one or more volatile or nonvolatile memory or storage devices, such as random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, magnetic data media, or optical storage media.

如所描述的，（多個）CPU 406可以將圖形處理卸載到（多個）GPU 412，諸如需要大規模並行操作的任務。作為一個示例，圖形處理需要大規模並行操作，並且CPU 406可以將這樣的圖形處理任務卸載到GPU 412。然而，諸如矩陣運算的其他操作也可以受益於GPU 412的並行處理能力。在這些示例中，CPU 406可以利用GPU 412的並行處理能力來使GPU 412執行非圖形相關操作。As described, CPU(s) 406 may offload graphics processing to GPU(s) 412, such as tasks that require massive parallelism. As one example, graphics processing requires massive parallelism, and CPU 406 may offload such graphics processing tasks to GPU 412. However, other operations, such as matrix operations, may also benefit from the parallel processing capabilities of GPU 412. In these examples, CPU 406 may utilize the parallel processing capabilities of GPU 412 to cause GPU 412 to perform non-graphics related operations.

（多個）CPU 406、（多個）GPU 412、（多個）NSP 430和/或FINT核432可以將渲染的圖像資料儲存在系統記憶體410內分配的幀緩衝器中。顯示介面416可以從幀緩衝器檢索資料並且配置顯示器418以顯示由經渲染的圖像資料表示的圖像。在一些示例中，顯示介面416可以包括數位/類比轉換器（DAC），其被配置為將從幀緩衝器檢索的數位值轉換成顯示器418可消耗的類比訊號。在其他示例中，顯示介面416可以將數位值直接傳遞到顯示器418以進行處理。The CPU(s) 406, the GPU(s) 412, the NSP(s) 430, and/or the FINT core 432 may store the rendered image data in an allocated frame buffer within the system memory 410. The display interface 416 may retrieve the data from the frame buffer and configure the display 418 to display an image represented by the rendered image data. In some examples, the display interface 416 may include a digital/analog converter (DAC) configured to convert the digital values retrieved from the frame buffer into analog signals consumable by the display 418. In other examples, the display interface 416 may pass the digital values directly to the display 418 for processing.

顯示器418可以包括監視器、電視、投影設備、液晶顯示器（LCD）、電漿體顯示面板、發光二極體（LED）陣列、陰極射線管（CRT）顯示器、電子紙、表面傳導電子發射顯示器（SED）、雷射電視顯示器、奈米晶體顯示器、有機發光二極體（OLED）顯示器或另一類型的顯示單元。顯示器418可以整合在計算設備402內。例如，顯示器418可以是行動電話手機或平板計算機的屏幕。可替代地，顯示器418可以是經由有線或無線通訊鏈路耦接到計算設備402的獨立設備。例如，顯示器418可以是經由電纜或無線鏈路連接到個人計算機的計算機監視器或平板顯示器。Display 418 may include a monitor, a television, a projection device, a liquid crystal display (LCD), a plasma display panel, a light emitting diode (LED) array, a cathode ray tube (CRT) display, electronic paper, a surface conduction electron emission display (SED), a laser television display, a nanocrystalline display, an organic light emitting diode (OLED) display, or another type of display unit. Display 418 may be integrated into computing device 402. For example, display 418 may be the screen of a mobile phone handset or a tablet computer. Alternatively, display 418 may be a separate device coupled to computing device 402 via a wired or wireless communication link. For example, display 418 may be a computer monitor or flat panel display connected to a personal computer via a cable or wireless link.

系統記憶體可以儲存神經網路模型422。神經網路模型422可以包括被訓練為接收一種或多種類型的輸入資料並且作為響應提供一種或多種類型的輸出資料的一個或多個人工神經網路（也稱為神經網路）。The system memory may store a neural network model 422. The neural network model 422 may include one or more artificial neural networks (also referred to as neural networks) trained to receive one or more types of input data and to provide one or more types of output data in response.

神經網路（例如，神經網路模型422）可以包括利用定義規則的節點的可訓練或自適應演算法。例如，多個節點中的相應節點可以利用函數（諸如非線性函數或if-then規則）來基於輸入生成輸出。多個節點中的相應節點可以沿著邊連接到多個節點中的一個或多個不同節點，使得相應節點的輸出包括不同節點的輸入。函數可以包括可以使用輸入和期望輸出的訓練集以及學習規則（諸如反向傳播學習規則）來確定或調整的參數。反向傳播學習規則可以利用將期望輸出與由神經網路產生的輸出進行比較的一個或多個誤差量測，以通過改變參數來訓練神經網路，以最小化所述一個或多個誤差量測。A neural network (e.g., neural network model 422) may include a trainable or adaptive algorithm that utilizes nodes that define rules. For example, a corresponding node in a plurality of nodes may utilize a function (e.g., a nonlinear function or an if-then rule) to generate an output based on an input. A corresponding node in a plurality of nodes may be connected to one or more different nodes in a plurality of nodes along an edge so that the output of the corresponding node includes the input of the different node. The function may include parameters that may be determined or adjusted using a training set of inputs and desired outputs and a learning rule (e.g., a back-propagation learning rule). Back-propagation learning rules can utilize one or more error measures that compare a desired output to an output produced by a neural network to train the neural network by varying parameters to minimize the one or more error measures.

在一些示例中，訓練神經網路模型422以執行輸入資料的分類。也就是說，可以訓練神經網路模型422以標記輸入資料，以將輸入資料分類為一個或多個種類或類別。神經網路模型422可以通過針對輸入資料確定多個類中的每一個的置信度分數來執行輸入資料的分類，該置信度分數指示認為輸入資料應該被分類到對應類中的程度。在其他示例中，神經網路模型422可以確定一組類別上的機率分佈，以指示輸入資料屬該組類別中的每一個類別的機率。In some examples, the neural network model 422 is trained to perform classification of input data. That is, the neural network model 422 can be trained to label the input data to classify the input data into one or more categories or classes. The neural network model 422 can perform classification of the input data by determining a confidence score for each of a plurality of classes for the input data, the confidence score indicating the degree to which the input data is believed to be classified into the corresponding class. In other examples, the neural network model 422 can determine a probability distribution over a set of classes to indicate the probability that the input data belongs to each class in the set of classes.

在一些示例中，可以訓練神經網路模型422以執行計算機視覺任務，諸如圖像分類、對象檢測和/或圖像分割。這樣的計算機視覺任務對於諸如自主駕駛的計算機視覺應用可能是有用的。例如，神經網路模型422可以被訓練以執行圖像分類以確定哪些對象在圖像或視頻中，諸如通過被訓練以將圖像分類為包括特定對象或不包括特定對象並且通過向圖像分配一個或多個標籤。在另一示例中，可以訓練神經網路模型422以執行對象檢測以檢測圖像或視頻中的對象並指定每個對象在圖像中的位置，並且可以訓練神經網路模型422以將一個或多個標籤分配給圖像中的一個或多個對象中的每一個。在一些示例中，可以訓練神經網路模型422以執行圖像分割，以將圖像分離成描繪潛在有意義區域的區域以供進一步處理。In some examples, the neural network model 422 can be trained to perform computer vision tasks such as image classification, object detection, and/or image segmentation. Such computer vision tasks may be useful for computer vision applications such as autonomous driving. For example, the neural network model 422 can be trained to perform image classification to determine which objects are in an image or video, such as by being trained to classify an image as including a particular object or not including a particular object and by assigning one or more labels to the image. In another example, the neural network model 422 can be trained to perform object detection to detect objects in an image or video and specify the location of each object in the image, and the neural network model 422 can be trained to assign one or more labels to each of the one or more objects in the image. In some examples, the neural network model 422 can be trained to perform image segmentation to separate an image into regions depicting potentially meaningful regions for further processing.

在一些示例中，神經網路模型422可以對由一個或多個相機424擷取的圖像執行一個或多個計算機視覺任務。也就是說，一個或多個相機424可擷取圖像，且一個或多個處理器440可將由一個或多個相機424擷取的圖像輸入到神經網路模型422中以對圖像執行一個或多個計算機視覺任務，例如圖像分類、對象檢測和/或圖像分割。In some examples, the neural network model 422 can perform one or more computer vision tasks on images captured by the one or more cameras 424. That is, the one or more cameras 424 can capture images, and the one or more processors 440 can input the images captured by the one or more cameras 424 into the neural network model 422 to perform one or more computer vision tasks on the images, such as image classification, object detection, and/or image segmentation.

根據本公開的一個或多個方面，視頻解碼器300可被配置為：對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料；基於所述經熵解碼的資料來預測基於塊的運動向量以生成預測的運動向量；從經熵解碼的資料解碼運動向量殘差；將所述運動向量殘差添加到預測的運動向量以產生基於塊的運動向量；使用所述基於塊的運動向量利用基於重疊塊的扭曲函數扭曲所述先前重構的視頻資料以產生預測的當前視頻資料；以及對預測的當前視頻資料與殘差塊求和以產生當前重構的視頻資料。According to one or more aspects of the present disclosure, the video decoder 300 may be configured to: perform parallel entropy decoding on encoded video data from a received bit stream to generate entropy decoded data; predict block-based motion vectors based on the entropy decoded data to generate predicted motion vectors; decode motion vector residuals from the entropy decoded data; add the motion vector residuals to the predicted motion vectors to generate block-based motion vectors; use the block-based motion vectors to warp the previously reconstructed video data using an overlapping block-based warping function to generate predicted current video data; and sum the predicted current video data with the residual block to generate current reconstructed video data.

根據本公開的方面，視頻解碼器300可以解碼超潛量（hyperlatent）；外推第一流；從超潛量解碼均值（mean）和縮放（scale）；基於所述縮放對潛量（latent）進行解碼；基於所述潛量和第一流重構第二流和殘差；使用所述第二流扭曲先前重構的幀以生成扭曲的幀；以及將殘差添加到扭曲的幀。According to aspects of the present disclosure, the video decoder 300 can decode a hyperlatent; extrapolate a first stream; decode a mean and a scale from the hyperlatent; decode a latent based on the scale; reconstruct a second stream and a residue based on the latent and the first stream; warp a previously reconstructed frame using the second stream to generate a warped frame; and add the residue to the warped frame.

圖3是示出根據本公開的一個或多個方面的神經視頻編碼器的端到端深度學習的示例的概念圖。例如，視頻編解碼器（例如，視頻編碼器200和/或視頻解碼器300）可以利用基於神經網路的I幀和P幀壓縮。3 is a conceptual diagram illustrating an example of end-to-end deep learning of a neural video encoder according to one or more aspects of the present disclosure. For example, a video codec (e.g., video codec 200 and/or video decoder 300) can utilize neural network-based I-frame and P-frame compression.

例如，在時間T1，視頻編碼器200可以獲得視頻資料的I幀（基準真相（Ground truth, GT）500）並使用神經網路模型422對I幀進行編碼，神經網路模型422可以包括本文討論的視頻編碼器200的元件。視頻編碼器200可以使用I幀編碼器（IFE）502對I幀進行編碼，並且使用熵編碼器（EE）504對編碼的I幀進行熵編碼。For example, at time T1, the video encoder 200 may obtain an I frame (ground truth (GT) 500) of video data and encode the I frame using a neural network model 422, which may include elements of the video encoder 200 discussed herein. The video encoder 200 may encode the I frame using an I frame encoder (IFE) 502, and entropy encode the encoded I frame using an entropy encoder (EE) 504.

視頻解碼器300可以使用神經網路模型422對編碼的I幀進行解碼，以再現原始I幀的表示。例如，視頻解碼器300可以使用熵解碼器（ED）506對經熵編碼的I幀進行熵解碼。視頻解碼器300可以使用I幀解碼器（IFD）508對編碼的I幀進行解碼，以產生重構幀（recon）510。視頻解碼器可以使用重構幀510作為扭曲函數（warp）512的輸入，以與未來解碼的P幀一起使用。The video decoder 300 may decode the coded I frame using the neural network model 422 to reproduce a representation of the original I frame. For example, the video decoder 300 may entropy decode the entropy coded I frame using an entropy decoder (ED) 506. The video decoder 300 may decode the coded I frame using an I frame decoder (IFD) 508 to produce a reconstructed frame (recon) 510. The video decoder may use the reconstructed frame 510 as an input to a warp function (warp) 512 for use with future decoded P frames.

在時間T2（在時間T1之後），視頻編碼器200可以使用其自己的原始I幀（視頻編碼器200可以包括本地版本的視頻解碼器300）的再現表示recon 510和P幀（GT 514）。視頻編碼器200可以使用I幀（recon 510）作為參考幀來執行視頻資料的GT 514（P幀）的運動估計（ME）516。運動估計516可以生成運動向量和/或運動向量殘差，其可以作為對P幀進行編碼的一部分而被編碼。視頻編碼器200可以使用神經網路模型422對P幀進行編碼。例如，視頻編碼器200可以使用P幀編碼器（PFE）518對GT 514進行編碼。然後，視頻編碼器200可以使用熵編碼器504對P幀進行熵編碼。At time T2 (after time T1), the video encoder 200 may use its own recon 510 representation of the original I frame (the video encoder 200 may include a local version of the video decoder 300) and the P frame (GT 514). The video encoder 200 may use the I frame (recon 510) as a reference frame to perform motion estimation (ME) 516 of the GT 514 (P frame) of the video data. The motion estimation 516 may generate motion vectors and/or motion vector residuals, which may be encoded as part of encoding the P frame. The video encoder 200 may encode the P frame using the neural network model 422. For example, the video encoder 200 may encode the GT 514 using the P-frame encoder (PFE) 518. Then, the video encoder 200 may entropy encode the P-frame using the entropy encoder 504.

視頻解碼器300可使用熵解碼器506對經熵編碼的P幀進行熵解碼。視頻解碼器300可以使用神經網路模型422對熵解碼器的輸出進行解碼，以生成與P幀相關聯的運動向量和塊殘差。例如，視頻解碼器300可以使用P幀解碼器（PFD）520來生成運動向量（MV）522和塊殘差（resid）524。視頻解碼器300可以使用扭曲512來使用運動向量522來扭曲重構的I幀510，並使用所得到的扭曲的幀與塊殘差524進行求和以生成重構的P幀（recon 528）。神經網路模型422的輸出530（例如，解碼的P幀資訊，諸如運動向量522、塊殘差524或其他解碼的P幀資訊）可以被反饋到神經網路（例如，神經網路模型422的P幀編碼器518和/或P幀解碼器520）中，以訓練P幀編碼器518和P幀解碼器520，如圖所示，用於編碼和/或解碼新的P幀（GT 532）。The video decoder 300 may entropy decode the entropy encoded P frame using an entropy decoder 506. The video decoder 300 may decode the output of the entropy decoder using a neural network model 422 to generate motion vectors and block residues associated with the P frame. For example, the video decoder 300 may use a P frame decoder (PFD) 520 to generate motion vectors (MVs) 522 and block residues (resid) 524. The video decoder 300 may use a warp 512 to warp the reconstructed I frame 510 using the motion vectors 522 and use the resulting warped frame to sum with the block residues 524 to generate a reconstructed P frame (recon 528). The output 530 of the neural network model 422 (e.g., decoded P frame information, such as motion vectors 522, block residues 524, or other decoded P frame information) can be fed back to the neural network (e.g., the P frame encoder 518 and/or the P frame decoder 520 of the neural network model 422) to train the P frame encoder 518 and the P frame decoder 520, as shown, for encoding and/or decoding new P frames (GT 532).

圖4是示出根據本公開的一個或多個方面的神經視頻編解碼器的示例的架構圖。神經編解碼器600可以是視頻編碼器200和/或視頻解碼器300的示例。舉例來說，發送器602可為包含視頻編碼器200的源設備102的示例，且接收器604可為包含視頻解碼器300的目的地設備116的示例。雖然神經編解碼器600被描繪為包括（發送器602的）視頻編碼器的元件和（接收器604的）視頻解碼器的元件，但是應當注意，視頻解碼器300可以在設備中單獨實現或者與視頻編碼器200一起實現。4 is an architectural diagram illustrating an example of a neural video codec according to one or more aspects of the present disclosure. The neural codec 600 may be an example of the video codec 200 and/or the video decoder 300. For example, the transmitter 602 may be an example of the source device 102 including the video codec 200, and the receiver 604 may be an example of the destination device 116 including the video decoder 300. Although the neural codec 600 is depicted as including elements of a video codec (of the transmitter 602) and elements of a video decoder (of the receiver 604), it should be noted that the video decoder 300 may be implemented in a device alone or together with the video codec 200.

視頻編碼器200可以編碼的輸入視頻資料X _t和重構先前幀X ̂ _（ _t-1 _）可以在YUV420顏色空間中。通過使用YUV420顏色空間，與其他顏色空間（諸如RGB）中的其他視頻資料相比，可以降低視頻編碼器200或視頻解碼器300可以執行的計算的複雜度。另外，YUV420顏色空間可以更好地與人類感知品質對準，並且因此可以改善人類對輸入視頻資料的解碼的再現中的品質的感知。 The input video data _{Xt that} the video encoder 200 may encode and the reconstructed previous frame X̂ ₍ _t-1 ₎ may be in the YUV420 color space. By using the YUV420 color space, the complexity of calculations that the video encoder 200 or the video decoder 300 may be performed may be reduced compared to other video data in other color spaces (such as RGB). In addition, the YUV420 color space may be better aligned with human perception quality and may therefore improve human perception of quality in decoded reproduction of the input video data.

視頻編碼器200或視頻解碼器300可以包括具有相對小的流外推器610的預測架構。流外推器610可以增加最小的計算開銷，但是可以增加改進的壓縮性能。因此，流外推器610可以是高效的流外推器，其可以以最小的計算開銷提供更好的壓縮性能。The video encoder 200 or the video decoder 300 may include a prediction architecture with a relatively small stream extrapolator 610. The stream extrapolator 610 may add minimal computational overhead, but may add improved compression performance. Therefore, the stream extrapolator 610 may be an efficient stream extrapolator that may provide better compression performance with minimal computational overhead.

視頻編碼器200或視頻解碼器300可以包括FINT核432，其可以在NSP430中實現，以使用運動向量對重構的先前幀執行基於塊的扭曲（warp）612，其可以是基於重疊塊的扭曲。 The video encoder 200 or the video decoder 300 may include a FINT core 432, which may be implemented in the NSP 430, to use the motion vector Previous frame of reconstruction A block-based warp is performed 612, which may be an overlapping block-based warp.

視頻編碼器200或視頻解碼器300可以使用並行化熵編解碼演算法來例如在GPU412上對位元流進行快速且高效的編碼或解碼。例如，視頻編碼器200或視頻解碼器300可以使用殘差自動編碼器620來執行並行化熵編解碼。例如，並行熵編解碼可以包括同時利用多個熵解碼實例或操作。自動編碼器620可以在GPU412上實現。The video encoder 200 or the video decoder 300 may use a parallelized entropy coding and decoding algorithm to quickly and efficiently encode or decode a bit stream, for example, on the GPU 412. For example, the video encoder 200 or the video decoder 300 may use a residual auto-encoder 620 to perform parallelized entropy coding and decoding. For example, parallel entropy coding and decoding may include utilizing multiple entropy decoding instances or operations simultaneously. The auto-encoder 620 may be implemented on the GPU 412.

視頻編碼器200或視頻解碼器300的神經網路組件可以被管線化，使得所有子系統（例如，（多個）CPU 406、（多個）GPU 412、（多個）NSP 430、FINT核432等）同時工作以達到期望的幀速率。The neural network components of the video encoder 200 or the video decoder 300 may be pipelined so that all subsystems (e.g., CPU(s) 406, GPU(s) 412, NSP(s) 430, FINT core 432, etc.) work simultaneously to achieve a desired frame rate.

神經網路模型422可以採用int8量化來進行有效推理。int8使用8位元整數而不是浮點數，並且使用整數數學而不是浮點數學，這可以減少記憶體和計算要求。在一些示例中，視頻編碼器200或視頻解碼器300可以將定制量化操作用於熵編解碼過程以改善壓縮性能。在神經網路模型422上執行的訓練可以是量化感知的。The neural network model 422 can use int8 quantization for efficient reasoning. int8 uses 8-bit integers instead of floating point numbers, and uses integer mathematics instead of floating point mathematics, which can reduce memory and computation requirements. In some examples, the video encoder 200 or the video decoder 300 can use customized quantization operations for the entropy encoding and decoding process to improve compression performance. The training performed on the neural network model 422 can be quantization-aware.

使用加法流預測（例如，加法630）而不是扭曲所述流可以從視頻解碼器300中移除一個計算上昂貴的扭曲操作。使用僅Y通道（僅亮度）640作為流自動編碼器650的輸入可以為流估計提供更大的計算效率。Using additive flow prediction (eg, addition 630) rather than warping the flow can remove a computationally expensive warping operation from the video decoder 300. Using only the Y channel (luminance only) 640 as input to the flow autoencoder 650 can provide greater computational efficiency for the flow estimation.

由於基於重疊塊的扭曲，運動向量是較低維度的，從而導致降低的計算複雜度。通過僅輸入像素空間輸入的Y通道640來進一步優化流自動編碼器650。可以將示例神經編解碼器600（其可以是神經視頻編解碼器）與先前的神經編解碼器進行比較。具體地，通過使用高效的基於塊的扭曲算子612並使流外推器網路610更輕量，可以使神經編解碼器600是硬體友好的。此外，與其他神經編解碼器相比，可以改變計算圖以減少扭曲操作的數量。Due to the overlapping block-based warping, the motion vectors are low-dimensional, resulting in reduced computational complexity. The stream autoencoder 650 is further optimized by inputting only the Y channel 640 of the pixel space input. The example neural codec 600 (which can be a neural video codec) can be compared to previous neural codecs. Specifically, the neural codec 600 can be made hardware-friendly by using an efficient block-based warp operator 612 and making the stream extrapolator network 610 more lightweight. In addition, the computation graph can be changed to reduce the number of warp operations compared to other neural codecs.

神經編解碼器600可以包括三個均值-縮放超先驗自動編碼器，例如，流外推器610、殘差自動編碼器620和流自動編碼器650。I幀自動編碼器（例如，圖3的I幀編碼器502和/或I幀解碼器508）壓縮每個畫面組（GoP）中的第一幀 x ₁和/或輸出重構幀 ₁。P幀模型（例如，圖3的P幀編碼器518和/或P幀解碼器520）可以包括流預測器、流外推器610，其基於來自最後時間步長 _t-1的重構流對當前流進行預測。然後，P幀編碼器相對於該預測壓縮殘差流：流自動編碼器650將基準真相幀 x _𝑡和用當前流扭曲的最後預測幀的Y通道作為輸入。P幀解碼器然後輸出𝛿流，將其添加到預測流以獲得重構流 _t= 𝛿f _𝑡+ 。最後，利用重構的流 = warp 來扭曲幀，並且對基準真相殘差r = x _𝑡− 進行壓縮、解碼（）並在解碼器側將其添加到扭曲的幀中： _𝑡= + 。 The neural codec 600 may include three mean-scaled hyper-prior auto-coders, e.g., a stream extrapolator 610, a residual auto-coder 620, and a stream auto-coder 650. The I-frame auto-coder (e.g., the I-frame encoder 502 and/or the I-frame decoder 508 of FIG. 3 ) compresses the first frame x ₁ in each group of pictures (GoP) and/or outputs a reconstructed frame _1. A P-frame model (eg, P-frame encoder 518 and/or P-frame decoder 520 of FIG. 3) may include a flow predictor, a flow extrapolator 610, based on the flow from the last time step The reconstructed flow of _t-1 is equivalent to the current flow Then, the P frame encoder compresses the residual stream relative to the prediction: the stream autoencoder 650 converts the reference truth frame x _𝑡 and the current stream into The P-frame decoder then outputs the 𝛿 stream, which is added to the prediction stream to obtain the reconstructed stream _t = 𝛿f _𝑡 + Finally, using the reconstructed stream =warp to distort the frame and the ground truth residual r = x _𝑡 − Compression, decoding ( ) and add it to the warped frame on the decoder side: _𝑡 = + .

例如，視頻解碼器300可以對來自接收到的位元流的經編碼的視頻資料進行並行熵解碼，以生成經熵解碼的資料。視頻解碼器300可以基於經熵解碼的資料來預測（例如，使用流外推器610）基於塊的運動向量，以生成預測的運動向量。視頻解碼器300可以從經熵解碼的資料解碼運動向量殘差𝛿f _𝑡（例如，使用流自動編碼器650）。視頻解碼器300可以將運動向量殘差𝛿f _𝑡加（例如，使用加法630）到預測的運動向量，以生成基於塊的運動向量。視頻解碼器300可使用基於塊的運動向量利用基於重疊塊的扭曲函數（例如，使用扭曲612）扭曲先前重構的視頻資料，以生成預測的當前視頻資料。視頻解碼器300可對預測的當前視頻資料與殘差塊求和（例如，使用求和614）以生成當前重構的視頻資料。 For example, the video decoder 300 may perform entropy decoding in parallel on the encoded video data from the received bitstream to generate entropy decoded data. The video decoder 300 may predict (eg, using the stream extrapolator 610) block-based motion vectors based on the entropy decoded data. , to generate the predicted motion vector The video decoder 300 may decode the motion vector residual 𝛿f _𝑡 from the entropy decoded data (eg, using the stream autoencoder 650). The video decoder 300 may add (eg, using the addition 630) the motion vector residual 𝛿f _𝑡 to the predicted motion vector , to generate block-based motion vectors The video decoder 300 may use block-based motion vectors Warping the previously reconstructed video data using a block-based warping function (e.g., using warp 612) , to generate the predicted current video data The video decoder 300 can predict the current video data. and residual block summed (e.g., using summation 614) to generate the currently reconstructed video data .

圖5是示出根據本公開的一個或多個方面的基於塊的扭曲的示例的概念圖。一些神經視頻編解碼器採用密集RGB縮放空間扭曲，這需要高維RGB輸入和輸出以及高維運動向量輸入。處理這樣的高維資料在計算上是昂貴的並且耗電。通過例如在FINT核432上使用高效的基於塊的扭曲612，可以降低計算複雜度並且可以節省功率。在圖5的示例中，扭曲612可以包括可以由視頻編碼器200或視頻解碼器300利用的高效的基於塊的YUV420扭曲技術。YUV420輸入的維度低於RGB（例如，由更少的位元表示）。使用基於塊的運動向量可以減小流大小。視頻編碼器200或視頻解碼器300可以應用混合濾波器來減少塊偽影。在一些示例中，圖5的示例可以在FINT核432中實現。FIG5 is a conceptual diagram showing an example of a block-based warp according to one or more aspects of the present disclosure. Some neural video codecs employ dense RGB scaling spatial warping, which requires high-dimensional RGB input and output and high-dimensional motion vector input. Processing such high-dimensional data is computationally expensive and power-consuming. By using an efficient block-based warp 612, for example, on a FINT core 432, computational complexity can be reduced and power can be saved. In the example of FIG5 , the warp 612 can include an efficient block-based YUV420 warp technique that can be utilized by a video encoder 200 or a video decoder 300. The dimensionality of the YUV420 input is lower than RGB (e.g., represented by fewer bits). Using block-based motion vectors can reduce stream size. The video encoder 200 or the video decoder 300 may apply a hybrid filter to reduce block artifacts. In some examples, the example of FIG. 5 may be implemented in the FINT core 432.

運動補償被視為神經視頻編解碼器的基本組成部分。在許多神經視頻編解碼器中，使用具有後向映射的像素密集扭曲，其中使用流場f來扭曲幀x。對於每個像素 i, j，在扭曲的幀中，如下從參考幀查找值： warp _dense（x, f ） _i,j = x[ i+ f _𝑥[ i,j], j+ f _𝑦[ i,j]]. 這裡[·]是指數組索引，並且𝑦子索引指示向量場f中的相應座標的檢索。對於非整數運動向量，通常可以使用雙線性或雙三次插值來計算像素強度。許多先前的神經編解碼器使用更基於空間-空間的扭曲版本，其中給出第三模糊座標。雖然這種基於空間-空間的扭曲版本可以提高壓縮性能，但是它也可能帶來額外的複雜度。 Motion compensation is considered as an essential component of neural video codecs. In many neural video codecs, pixel dense warping with backward mapping is used, where a frame x is warped using a flow field f. For each pixel i , j in the warped frame, the value is looked up from the reference frame as follows: warp _dense (x, f) _i,j = x[ i + _f𝑥 [ i,j ], j + _f𝑦 [ i,j ]]. Here [·] refers to the array index, and the 𝑦 sub-index indicates the retrieval of the corresponding coordinate in the vector field f. For non-integer motion vectors, bilinear or bicubic interpolation can often be used to compute the pixel intensities. Many previous neural codecs use a more space-space based version of warping, where a third blurred coordinate is given. While this space-space based warping can improve compression performance, it can also introduce additional complexity.

由於運動向量f（其也可以被稱為流或光流）通常可以具有大的均勻區域，因此基於塊的扭曲可以用作密集扭曲的替代方案，具有較小的計算複雜度。在基於塊的扭曲中，扭曲的幀可以被劃分為大小為𝑏 × 𝑏的塊。可以使用單個共享運動向量從參考幀查找塊內部的所有像素。幀可以如下扭曲： warp _block（x, f, 𝑏） _{i, j} = Since the motion vector f (which can also be called flow or optical flow) can generally have a large uniform area, block-based warping can be used as an alternative to dense warping with less computational complexity. In block-based warping, the warped frame can be divided into blocks of size 𝑏 × 𝑏. All pixels inside the block can be looked up from a reference frame using a single shared motion vector. The frame can be warped as follows: warp _block (x, f, 𝑏) _{i, j} =

由於逐塊記憶體存取，基於塊的扭曲可以比其他類型的扭曲更高效。然而，基於塊的扭曲的一個缺點是在具有不同運動向量的塊的塊邊緣周圍可能發生偽影。這種偽像的可能性可以通過使用基於重疊塊的運動補償來應對或解決，其中每個塊使用N-1個周圍運動向量（f）被扭曲多次，並且結果可以使用朝向塊的末端衰減的核w∈R ^bxbxN來平均。 Block-based warping can be more efficient than other types of warping due to block-by-block memory access. However, one drawback of block-based warping is that artifacts can occur around block edges with blocks of different motion vectors. The possibility of such artifacts can be addressed or resolved by using overlapping block-based motion compensation, where each block is warped multiple times using N-1 surrounding motion vectors (f), and the results can be averaged using a kernel ^w∈RbxbxN that decays towards the ends of the block.

例如，重疊塊扭曲運動補償可以表示為： warp _{block-overlap}（x, f, w, 𝑏） _i,j = ，其中是相鄰塊的相對位置（例如，左上、中心、底部等），x是要扭曲的幀的像素值，f是運動向量（例如，x和y位移），w是混合核的權重，並且b是塊大小。注意，扭曲在x方向和y方向上進行。 For example, the overlap block warp motion compensation can be expressed as: warp _{block-overlap} (x, f, w, 𝑏) _i,j = ,in is the relative position of the neighboring block (e.g., top left, center, bottom, etc.), x is the pixel value of the frame to be warped, f is the motion vector (e.g., x and y displacement), w is the weight of the blending kernel, and b is the block size. Note that the warping is done in both the x and y directions.

如下文所討論的，基於重疊塊的扭曲可以導致比基於塊的扭曲更好的壓縮性能，並且可以匹配密集扭曲的性能，同時導致更好的效率。As discussed below, overlapping block-based warps can lead to better compression performance than block-based warps and can match the performance of dense warps while leading to better efficiency.

圖6是示出視頻解碼器300中的管線推理的示例的架構圖。例如，視頻解碼器300可以並行處理3個幀（幀t、幀t+1和幀t+2）。這允許全HD視頻的通量＞30FPS。在圖6的示例中，管線700的每個階段連同其相應的輸入和輸出一起被描繪，並且基於計算設備402的哪些處理器可以執行管線900的特定階段的示例來標記。管線700的每個塊在水平方向上的寬度表示該特定階段的近似運行時間。FIG6 is an architectural diagram illustrating an example of pipeline reasoning in a video decoder 300. For example, the video decoder 300 can process 3 frames in parallel (frame t, frame t+1, and frame t+2). This allows a throughput of >30 FPS for full HD video. In the example of FIG6 , each stage of the pipeline 700 is depicted along with its corresponding inputs and outputs and is labeled based on an example of which processors of the computing device 402 can execute a particular stage of the pipeline 900. The width of each block of the pipeline 700 in the horizontal direction represents the approximate runtime of that particular stage.

例如，視頻解碼器300可以從接收到的位元流中解碼超潛量。在一些示例中，視頻解碼器300使用GPU 412對超潛量進行解碼。視頻解碼器300可以推理流並且在NSP430上從超潛量解碼均值和縮放。然後，視頻解碼器300（例如，經由GPU 412）可以基於縮放對潛量進行解碼。視頻解碼器300（例如，經由NSP 430）可以基於潛量和外推流來重構流和殘差。視頻解碼器300可以使用重構流來扭曲例如在FINT核432上的先前時間戳的重構幀。然後，視頻解碼器300可以例如在CPU 406上將重構的殘差添加到扭曲的幀，以產生重構的P幀。For example, the video decoder 300 can decode the superlatent from the received bit stream. In some examples, the video decoder 300 uses the GPU 412 to decode the superlatent. The video decoder 300 can infer the stream and decode the mean and scaling from the superlatent on the NSP 430. Then, the video decoder 300 (e.g., via the GPU 412) can decode the potential based on the scaling. The video decoder 300 (e.g., via the NSP 430) can reconstruct the stream and the residual based on the potential and the extrapolated stream. The video decoder 300 can use the reconstructed stream to warp, for example, a reconstructed frame of a previous timestamp on the FINT core 432. The video decoder 300 may then add the reconstructed residuals to the distorted frames, for example on the CPU 406, to produce reconstructed P frames.

神經編解碼器600可以使用整數量化來進行高效推理。大多數權重和啟動可以在性能損失很小的情況下被量化。應當理解，熵編碼中使用的潛量以及均值和縮放的量化可以使用特定的實施方式以避免大的性能下降。The neural codec 600 can use integer quantization for efficient reasoning. Most weights and activations can be quantized with little performance loss. It should be understood that the quantization of the potential used in entropy coding as well as the mean and scale can use specific implementations to avoid large performance degradation.

神經編解碼器600可以包括神經網路模型（例如，神經網路模型422），其對於全HD（1080×1920像素）YUV 4:2:0視頻可以達到30+幀每秒（FPS）的通量。由於推理的不同部分可以在（例如，NSP430的）不同處理器或子系統處理器上實現，因此諸如圖6所示的管線架構可以高效地利用可用的計算能力。在示例管線700中，可以通過並行使用GPU 412、NSP 430、CPU 406、FINT 432來同時處理來自過去的多達三個時間步長的資料。為了獲得所需的通量，可能需要使用並行化。與同時用於各種任務的多核CPU實施方式不同，（多個）GPU 412可以相對容易地用於專門執行熵編碼。在NSP 430和GPU 412被配置為共享記憶體的情況下，在處理元件之間複製資料可能沒有延遲。通過對資料元素僅使用8位元整數，可以減少或最小化要處理的熵編解碼資料的總量。可以訓練超先驗網路以最小化由於使用低精度整數參數進行熵編碼而引起的壓縮損失。該方法可以用於將編解碼表所需的記憶體量限制為僅1,228字節。The neural codec 600 may include a neural network model (e.g., neural network model 422) that can achieve a throughput of 30+ frames per second (FPS) for full HD (1080×1920 pixels) YUV 4:2:0 video. Since different parts of the inference can be implemented on different processors (e.g., of the NSP 430) or subsystem processors, the pipeline architecture shown in FIG. 6 can efficiently utilize the available computing power. In the example pipeline 700, data from up to three time steps in the past can be processed simultaneously by using the GPU 412, NSP 430, CPU 406, and FINT 432 in parallel. In order to obtain the desired throughput, it may be necessary to use parallelization. Unlike multi-core CPU implementations that are used for a variety of tasks simultaneously, (multiple) GPUs 412 can be relatively easily used to perform entropy encoding exclusively. In the case where the NSP 430 and GPU 412 are configured to share memory, there may be no delay in copying data between the processing elements. By using only 8-bit integers for data elements, the total amount of entropy encoded decoded data to be processed can be reduced or minimized. The hyper-prior network can be trained to minimize the compression loss caused by using low-precision integer parameters for entropy encoding. This method can be used to limit the amount of memory required for the codec table to only 1,228 bytes.

算術編解碼函數可以例如用OpenCL實現，其包括小尺寸的表。利用OpenCL的小尺寸表，並行化主要由OpenCL工作項的數量定義，並且工作組組織可能不是關鍵的。The arithmetic codec function may be implemented, for example, in OpenCL, which includes small-sized tables. With OpenCL's small-sized tables, parallelism is mainly defined by the number of OpenCL work-items, and workgroup organization may not be critical.

圖7是根據本公開的一個或多個方面的示例神經視頻編解碼器功能的概念圖。在神經編解碼器600的浮點示例中，僅使用捨入（圖中的捨入運算符）來量化符號s。對於神經編解碼器600的量化示例，使用示出為三角形的量化器。例如，神經編解碼器600可以在圖7中描繪的三角形的位置處執行例如int8的量化。FIG7 is a conceptual diagram of an example neural video codec functionality according to one or more aspects of the present disclosure. In the floating point example of the neural codec 600, only rounding (round operator in the figure) is used to quantize the symbol s. For the quantization example of the neural codec 600, a quantizer shown as a triangle is used. For example, the neural codec 600 can perform quantization, such as int8, at the location of the triangle depicted in FIG7.

在圖7的示例中，𝒚表示傳輸之前的潛量，𝒔表示符號，表示用於熵編解碼的整數符號，表示用於訓練期間使用的率代理（rate proxy）的雜訊符號，表示用於解碼的重構潛量，σ表示用於熵編碼的標準偏差，以及μ表示潛量的均值。在圖7的示例中，符號可以是用於熵編碼的整數網格（例如，間隔寬度（bin width）=1）。傳輸之前的潛量和潛量𝝁上的均值可以具有子整數精度以獲得良好的壓縮性能。例如，傳輸𝒚之前的潛量和潛量𝝁上的均值可以具有16位元長度（例如，使用16位元量化）和/或具有0.25的間隔寬度。應當注意，這可以導致更少的極值。 In the example of Figure 7, 𝒚 represents the potential before transmission, 𝒔 represents the sign, Indicates the integer sign used for entropy encoding and decoding. represents the noise symbol for the rate proxy used during training, represents the reconstructed latent for decoding, σ represents the standard deviation for entropy coding, and μ represents the mean of the latent. In the example of FIG. 7 , the symbol may be an integer grid for entropy coding (e.g., bin width = 1). The latent before transmission and the mean over the latent 𝝁 may have sub-integer precision for good compression performance. For example, the latent before transmission 𝒚 and the mean over the latent 𝝁 may have a 16-bit length (e.g., using 16-bit quantization) and/or have a bin width of 0.25. It should be noted that this may result in fewer extreme values.

在一些示例中，可以使用縮放𝝈的特殊參數化，如在2022年9月9日提交的美國專利申請No.17/931，073和/或在2022年7月22日提交的美國專利申請No.17/814，426中所描述的，這兩個專利申請的全部內容通過引用併入本文。In some examples, special parameterizations of scaling 𝝈 may be used, as described in U.S. patent application No. 17/931,073 filed on September 9, 2022 and/or in U.S. patent application No. 17/814,426 filed on July 22, 2022, the entire contents of both of which are incorporated herein by reference.

現有的神經視頻編解碼器不一定被設計用於YUV顏色空間，可能需要昂貴的運動補償技術，採用理論位元率或慢熵編碼，並且利用昂貴的浮點運算。因此，現有的神經視頻編碼器在諸如行動設備（例如，智慧型電話）的電池受限設備上實現是具有挑戰性的。Existing neural video codecs are not necessarily designed for YUV color space, may require expensive motion compensation techniques, employ theoretical bit rate or slow entropy coding, and utilize expensive floating point operations. Therefore, existing neural video codecs are challenging to implement on battery-constrained devices such as mobile devices (e.g., smartphones).

由於捨入和添加均勻雜訊與啟動量化之間存在相互作用，因此在量化潛在瓶頸時應注意。均值縮放超先驗的潛在瓶頸如圖7所示。函數800包括在推理期間執行的函數，而函數850包括在訓練期間執行的函數。如上所述，對於神經編解碼器600的浮點版本，不添加如菱形所示的量化器。Because of the interaction between rounding and adding uniform noise and enabling quantization, care should be taken when quantizing potential bottlenecks. The potential bottleneck of the mean-scaled hyper-prior is shown in Figure 7. Function 800 includes functions executed during inference, while function 850 includes functions executed during training. As described above, for the floating-point version of the neural codec 600, no quantizer is added as shown by the diamond.

由於在神經編解碼器600的量化版本中傳遞到視頻解碼器300的符號總是被捨入，因此符號量化器可以具有為1的區間寬度，因為E[y−𝜇] = 0，量化器是以0為中心的量化器。值可以具有大的動態範圍，並且將 𝜎量化到均勻網格可能對性能有害。這可以通過使用指數啟動函數𝑓 （𝜌） = ....來避免，其中𝜌∈[0，1]。此啟動函數可使得能夠量化 𝜎而不損失性能（參見圖8的第VI行）。 Since the signs passed to the video decoder 300 are always rounded in the quantized version of the neural codec 600, the sign quantizer can have a bin width of 1, since E[y−𝜇] = 0 and the quantizer is a 0-centered quantizer. The values can have a large dynamic range, and quantizing 𝜎 to a uniform grid can be detrimental to performance. This can be avoided by using an exponential activation function 𝑓(𝜌) = ...., where 𝜌∈[0,1]. This activation function enables quantization of 𝜎 without loss of performance (see row VI of Figure 8).

圖8是描繪根據本公開的一個或多個方面的神經編碼器模型架構消融的示例結果的表。圖8的表中表示的神經編碼器是僅使用第一訓練階段（例如，1百萬個訓練步驟）訓練的浮點編解碼器，除了針對第一訓練階段和第二訓練階段訓練的模型VII之外（參見下文的圖13）。第二訓練階段可以是使用例如250,000個步驟的微調階段。僅針對P幀模型示出了參數和kMACS/px，並且在1080×1920YUV420輸入幀上計算kMACS/px。對於詳盡的複雜度分析，參見下文的圖14。FIG8 is a table depicting example results of ablation of a neural encoder model architecture according to one or more aspects of the present disclosure. The neural encoders represented in the table of FIG8 are floating point codecs trained using only the first training phase (e.g., 1 million training steps), except for Model VII trained for the first training phase and the second training phase (see FIG13 below). The second training phase can be a fine-tuning phase using, for example, 250,000 steps. Parameters and kMACS/px are shown only for the P-frame model, and kMACS/px is calculated on a 1080×1920YUV420 input frame. For a detailed complexity analysis, see FIG14 below.

然後問題仍然是如何量化潛量y和均值𝜇。將潛量和均值量化到與具有間隔寬度為1且偏置為0的符號相同的網格可能看起來是合理的。然而，如圖8的第VII行所示，這導致壓縮性能的不利下降。相反，神經編解碼器600可以被配置為具有用於均值的子整數精度，使得神經編解碼器600可以校正潛量的捨入誤差。在間隔寬度為0.25的情況下，性能幾乎提高到2倍，儘管可能的潛量的範圍減小（參見圖8的行VIII）。注意，僅縮放超先驗沒有遭受這些問題，但是如圖8的表所示，該模型具有更差的壓縮性能。The question then remains how to quantize the latent y and the mean 𝜇. It might seem reasonable to quantize the latent and the mean to the same grid as with an interval width of 1 and a bias of 0. However, as shown in row VII of Figure 8, this results in an unfavorable degradation in compression performance. Instead, the neural codec 600 can be configured to have sub-integer precision for the mean so that the neural codec 600 can correct for rounding errors in the latent. With an interval width of 0.25, performance is improved almost by a factor of 2, although the range of possible latents is reduced (see row VIII of Figure 8). Note that scaling the hyper-prior alone does not suffer from these problems, but as shown in the table of Figure 8, the model has worse compression performance.

迄今為止報告的所有數字都是使用訓練後量化（post-training quantization， PTQ）獲得的。已知這是次優量化技術，並且實際上導致差的壓縮性能。通過以量化感知訓練（QAT）階段跟進PTQ階段，其中使用梯度下降來優化模型和量化器參數，神經編解碼器600可以避免這種差的壓縮性能。例如，在PTQ階段後進行QAT階段，壓縮性能得到改善，並且相對於浮點模型提供83%的BD率增加的最終性能。All numbers reported so far were obtained using post-training quantization (PTQ). This is known to be a suboptimal quantization technique and actually results in poor compression performance. The neural codec 600 can avoid this poor compression performance by following the PTQ stage with a quantization-aware training (QAT) stage, where gradient descent is used to optimize the model and quantizer parameters. For example, by following the PTQ stage with the QAT stage, compression performance is improved and provides a final performance of an 83% increase in BD rate relative to the floating point model.

圖9是示出根據本公開的一個或多個方面的可在示例視頻編解碼器中實施的示例技術的概念圖。例如，視頻編碼器200或視頻解碼器300可以包括如本文所公開的YUV420壓縮架構。例如，視頻編碼器200或視頻解碼器300可以包括如本文所公開的基於塊的扭曲。例如，視頻編碼器200或視頻解碼器300可以包括如本文所公開的並行熵編碼。例如，視頻編碼器200或視頻解碼器300可以包括量化和量化感知訓練，諸如AI模型效率工具包（AIMET）逐通道量化感知訓練，以提高量化模型準確度。AIMET是一個用於優化訓練過的神經網路模型的開源庫。9 is a conceptual diagram illustrating example techniques that may be implemented in an example video codec according to one or more aspects of the present disclosure. For example, the video codec 200 or the video decoder 300 may include a YUV420 compression architecture as disclosed herein. For example, the video codec 200 or the video decoder 300 may include block-based distortion as disclosed herein. For example, the video codec 200 or the video decoder 300 may include parallel entropy coding as disclosed herein. For example, the video codec 200 or the video decoder 300 may include quantization and quantization-aware training, such as the AI Model Efficiency Toolkit (AIMET) channel-by-channel quantization-aware training, to improve quantization model accuracy. AIMET is an open source library for optimizing trained neural network models.

圖10是圖解根據本公開的一個或多個方面的設備上的神經視頻編解碼器的示例的概念圖。在圖10的示例中，諸如計算設備402的設備可以包括視頻編解碼器，該視頻編解碼器可以包括視頻編碼器200和/或視頻解碼器300。計算設備402可以使用相機424擷取視頻資料，例如，以1920×1080（1080p）格式。視頻編碼器200可以利用神經網路模型422來編碼擷取的視頻資料，並且可以將編碼的視頻資料熵編碼為位元流。視頻解碼器300可以對所接收的位元流進行熵解碼。視頻解碼器300可對經熵解碼的資料進行解碼以重構所擷取的視頻資料。計算設備402可以在1080p中以大於或等於30幀每秒經由顯示器418顯示重構的所擷取的視頻資料。FIG10 is a conceptual diagram illustrating an example of a neural video codec on a device according to one or more aspects of the present disclosure. In the example of FIG10 , a device such as a computing device 402 may include a video codec, which may include a video codec 200 and/or a video decoder 300. The computing device 402 may capture video data using a camera 424, for example, in a 1920×1080 (1080p) format. The video codec 200 may encode the captured video data using a neural network model 422, and may entropy encode the encoded video data into a bit stream. The video decoder 300 may entropy decode the received bit stream. The video decoder 300 may decode the entropy decoded data to reconstruct the captured video data. The computing device 402 may display the reconstructed captured video data via the display 418 at greater than or equal to 30 frames per second in 1080p.

圖11是將各種技術的壓縮性能和計算以及率失真（rate-distortion）性能與本公開的技術進行比較的一組曲線圖。從圖11中可以看出，量化可能對性能具有相對大的影響。FIG11 is a set of graphs comparing the compression performance and computation and rate-distortion performance of various techniques with the technique of the present disclosure. As can be seen from FIG11 , quantization may have a relatively large impact on performance.

左側的模型複雜度分析900示出了壓縮性能與計算。可以基於率/失真曲線圖950中的曲線來計算BD率，其中較低的數字是較好的。可以使用全HD輸入來計算kMACS/像素和BOPs/px。與MAC不同，二進制運算（BOP）考慮了由於模型量化引起的效率增益。右側的率/失真曲線圖展示本公開的技術的率-失真性能及各種基線。The model complexity analysis 900 on the left shows compression performance vs. calculations. The BD rate can be calculated based on the curves in the rate/distortion plot 950, where lower numbers are better. The kMACS/pixel and BOPs/px can be calculated using full HD input. Unlike MAC, the binary operations (BOP) take into account the efficiency gains due to model quantization. The rate/distortion plot on the right shows the rate-distortion performance of the disclosed technique and various baselines.

率-失真曲線圖950示出了神經編解碼器600的率-失真性能和各種基線。模型複雜度分析900示出了各種神經編碼器的BD率與模型複雜度。除了圖11之外，BD率結果在下文中總結在圖15中，以便於閱讀。在圖11等的示例中，神經編解碼器600可以表示為QODEC，fp32指示32位元的浮點值，並且int8指示8位元的定點值。當查看浮點神經編解碼器時，可以看到最佳性能模型是DCVC-DC，但它也具有最大的模型複雜度。神經編解碼器600的浮點版本具有最低的模型複雜度（接收器端為24.5kmacs/像素），並且與SSF模型的BD率匹配，SSF模型具有高達8倍的FLOP計數。與也被設計用於設備上推理的行動編解碼器（MobileCoder）相比，神經編解碼器600將壓縮性能提高了48%，同時還將模型複雜度降低到小於1/10。可以注意到，在高於1.Mb/s的位元率下，行動編解碼器可以在壓縮性能方面優於神經編解碼器600。然而，這些位元率很少在實踐中使用。The rate-distortion curve 950 shows the rate-distortion performance of the neural codec 600 and various baselines. The model complexity analysis 900 shows the BD rate and model complexity of various neural codecs. In addition to Figure 11, the BD rate results are summarized in Figure 15 below for ease of reading. In the examples of Figure 11, etc., the neural codec 600 can be represented as QODEC, fp32 indicates a 32-bit floating point value, and int8 indicates an 8-bit fixed point value. When looking at the floating-point neural codec, it can be seen that the best performing model is DCVC-DC, but it also has the largest model complexity. The floating-point version of the neural codec 600 has the lowest model complexity (24.5kmacs/pixel at the receiver end) and matches the BD rate of the SSF model, which has up to 8 times the FLOP count. Compared to the MobileCoder, which is also designed for on-device inference, the Neural Coder 600 improves compression performance by 48% while also reducing model complexity to less than 1/10. It can be noted that at bit rates above 1.Mb/s, the MobileCoder can outperform the Neural Coder 600 in compression performance. However, these bit rates are rarely used in practice.

量化神經編解碼器可以受益於高效的整數硬體核。在壓縮性能方面，這些神經編解碼器模型都不能勝過浮點神經基線。然而，這些是已經被證明能夠在行動設備上即時解碼的唯一的端到端神經視頻編解碼器。神經編解碼器600優於MobileCoder int8，這是唯一一個顯示即時視頻解碼並且節省了40%的BD率的產品。Quantized neural codecs can benefit from efficient integer hardware cores. None of these neural codec models can outperform floating point neural baselines in terms of compression performance. However, these are the only end-to-end neural video codecs that have been demonstrated to decode in real-time on mobile devices. Neural Codec 600 outperforms MobileCoder int8, the only product to show real-time video decoding and save 40% of BD rate.

在HEVC-B上，神經編解碼器600實現了＞30 FPS的推理速度。雖然神經編解碼器600的編碼管線未被優化，但是神經編解碼器600仍然實現了有效的編碼率。如上所述，圖6闡述了管線的每個階段的近似持續時間。推理速度受網路處理的限制，並且由於並行化，扭曲操作和熵編解碼不會引起任何額外的速度開銷。神經編解碼器600與MobileCoder相比不僅提高了壓縮性能，而且還提高了通量，因為神經編解碼器600處理全HD（1080×1920）視頻而不是（720×1280）視頻。On HEVC-B, the Neural Codec 600 achieves an inference speed of >30 FPS. Although the encoding pipeline of the Neural Codec 600 is not optimized, the Neural Codec 600 still achieves an effective encoding rate. As mentioned above, Figure 6 illustrates the approximate duration of each stage of the pipeline. The inference speed is limited by the network processing, and due to parallelization, the warping operation and entropy coding and decoding do not incur any additional speed overhead. The Neural Codec 600 not only improves the compression performance compared to MobileCoder, but also improves the throughput because the Neural Codec 600 processes full HD (1080×1920) video instead of (720×1280) video.

現在討論整數模型量化。量化神經網路的所有權重和啟動可以極大地提高其功率效率。本公開的技術包括使用具有學習的均勻網格的整數量化將神經編解碼器600量化為8位元，其中學習的均勻網格由區間寬度 𝑠和零項參數 𝑎定義。對於網路權重，神經編解碼器600可以在不使用零項的情況下學習每個輸出通道的網格（例如，對稱的每通道量化）。對於啟動，神經編解碼器600可以學習包括零項（例如，不對稱張量量化）的硬體友好的單區間寬度網格。這種量化不同於使用每通道啟動量化來量化學習的圖像壓縮網路的其他神經編解碼器的量化。使用每通道啟動量化可能需要為每個輸入通道重新縮放累加器，因此可能無法受益於硬體上的優化整數算術核（除非縮放是2的因子）。在一些示例中，神經編解碼器600的偏置不被量化。 Now let's discuss integer model quantization. Quantizing all weights and activations of a neural network can greatly improve its power efficiency. The disclosed techniques include quantizing a neural codec 600 to 8 bits using integer quantization with a learned uniform grid defined by an interval width 𝑠 and zero-entry parameters 𝑎 . For network weights, the neural codec 600 can learn a grid for each output channel without using zero entries (e.g., symmetric per-channel quantization). For activations, the neural codec 600 can learn a hardware-friendly single-interval width grid that includes zero entries (e.g., asymmetric tensor quantization). This quantization is different from the quantization of other neural codecs that use per-channel activation quantization to quantize learned image compression networks. Using per-channel activation quantization may require rescaling the accumulator for each input channel and therefore may not benefit from optimized integer arithmetic kernels on hardware (unless the scaling is a factor of 2). In some examples, the biases of the neural codec 600 are not quantized.

圖12是描繪根據本公開的一個或多個方面的示例神經編解碼器的各種量化的示例BD率性能結果的表。在圖12的示例中，û表示值未被量化，但保持在浮點32中。ü指示利用學習的網格將值量化為8位元。FIG12 is a table depicting example BD rate performance results for various quantizations of an example neural codec according to one or more aspects of the present disclosure. In the example of FIG12 , û indicates that the value is not quantized but remains in floating point 32. ü indicates that the value is quantized to 8 bits using a learned grid.

進行了廣泛的模型量化實驗，並且發現相對流行的均值-縮放超先驗模型在潛在瓶頸中對量化器的設置非常敏感。對於神經編解碼器600，可以手動固定一些啟動量化器的網格，並且可以學習剩餘量化器的網格，如圖12所示，並且學習剩餘量化器的網格。通過消融各種特徵並且評估各種特徵被消融的神經編解碼器來評估神經編解碼器600的特定特徵對性能的影響。結果顯示在圖12中。Extensive model quantization experiments were conducted and it was found that the relatively popular mean-scaled hyper-prior model is very sensitive to the quantizer settings in potential bottlenecks. For the neural codec 600, some of the grids of the starting quantizer can be manually fixed, and the grid of the residual quantizer can be learned, as shown in Figure 12. The impact of specific features of the neural codec 600 on performance is evaluated by ablating various features and evaluating the neural codecs with various features ablated. The results are shown in Figure 12.

現在討論實驗結果。評估神經編解碼器600的兩個主軸是率失真（R-D）性能和計算效率。實驗結果表明，與先前的神經設備上編解碼器相比，計算複雜度降低，同時改善了壓縮性能並且節省BD率。相對於x.265的BD率 PSNR YUV 611 僅PSNR Y QODEC int8 MobileCoder int8（Le，2022） 141.6 % 340.2 % 107.7 % 288.4 % QODEC fp32 MobileCoder fp32（Le，2022） 49.5 % 191.0 % 30.7 % 141.6 % DCVC-DC（Li，2023） SSF-Pred（Pourreza，2023） SSF-YUV（Pourreza，2023） -58.5 -24.3 % 54.4 % -57.62 % 10.2 % 10.1 % 表1：針對HEVC-B資料集上的不同基線的BD率節省 We now discuss the experimental results. The two main axes of evaluation for the neural codec 600 are rate-distortion (RD) performance and computational efficiency. The experimental results show that the computational complexity is reduced while improving the compression performance and saving BD rate compared to previous neural on-device codecs. Relative to the BD rate of x.265 PSNR YUV 611 PSNR Y only QODEC int8 MobileCoder int8 (Le, 2022) 141.6% 340.2% 107.7% 288.4% QODEC fp32 MobileCoder fp32 (Le, 2022) 49.5% 191.0% 30.7% 141.6% DCVC-DC (Li, 2023) SSF-Pred (Pourreza, 2023) SSF-YUV (Pourreza, 2023) -58.5 -24.3 % 54.4 % -57.62% 10.2% 10.1% Table 1: BD rate savings for different baselines on the HEVC-B dataset

行I描繪了未量化模型的壓縮性能。行III描繪了將具有已知網格的啟動量化為8位元整數的結果。這些啟動包括像素空間輸入、運動向量和捨入符號（按均值移位的潛量）。這種量化導致壓縮性能降低18.6%，主要是由於流量化，並且提供了量化性能的上限。行IV描繪了使用行中的每通道學習的量化網格來量化權重的結果，這進一步將壓縮性能降低到27.1%。行V描繪了另外量化除了潛量和超潛量瓶頸中的啟動之外的所有啟動的結果，並且壓縮性能下降到135.9%。Row I depicts the compression performance of the unquantized model. Row III depicts the results of quantizing the activations with a known grid to 8-bit integers. These activations include pixel-space inputs, motion vectors, and rounding signs (mean-shifted latent). This quantization results in an 18.6% reduction in compression performance, primarily due to flow quantization, and provides an upper bound on the quantization performance. Row IV depicts the results of quantizing the weights using the per-channel learned quantization grid in row , which further reduces the compression performance to 27.1%. Row V depicts the results of additionally quantizing all activations except those in the latent and super-latent bottlenecks, and the compression performance drops to 135.9%.

在模型消融中，示出了重疊塊扭曲是比其他類型的扭曲更好的高效神經編解碼器的選擇。在量化消融中，示出了在量化均值-縮放超先驗壓縮模型時可能出現的問題，以及如何規避這些問題。In the model ablations, it is shown that overlapping block warping is a better choice for efficient neural codecs than other types of warping. In the quantization ablations, it is shown that problems may arise when quantizing the mean-scale hyper-prior compression model, and how to circumvent these problems.

關於扭曲，如所預期的，具有密集扭曲的模型（行III）具有比重疊塊扭曲模型（行I）更好的RD性能。然而，性能差距相對較小（6%BD率的成本），並且由於較高的流維度，密集扭曲模型具有超過重疊塊扭曲模型的模型複雜度的4倍的模型複雜度。當將重疊塊扭曲模型與沒有塊重疊的基於極簡（vanilla）塊的扭曲方案（II）進行比較時，在壓縮性能方面存在19%的下降。可替代地，可以使用流不可知的神經網路模型。例如，包括其變體的模型使用可以隱式地對模型扭曲的條件卷積網路（細節在下文中在圖16中示出）。可以看出，在行IV中示出的圖16的模型在計算和壓縮性能方面對於重疊塊扭曲模型都是次優的。Regarding warping, as expected, the model with dense warping (row III) has better RD performance than the overlapping block warping model (row I). However, the performance gap is relatively small (at the cost of 6% BD rate), and due to the higher flow dimensionality, the dense warping model has more than 4 times the model complexity of the overlapping block warping model. When the overlapping block warping model is compared with a vanilla block-based warping scheme (II) without block overlapping, there is a 19% drop in compression performance. Alternatively, a flow-agnostic neural network model can be used. For example, the model including its variants uses a conditional convolutional network that can implicitly warp the model (details are shown in Figure 16 below). It can be seen that the model of FIG. 16 shown in row IV is suboptimal for the overlapping block warping model in terms of both computation and compression performance.

具有僅縮放先驗而不是均值-縮放先驗的神經編解碼器600的版本壓縮性能顯著降低（BD率增加9.6%），而效率增益最小（參見第VI行）。The version of the neural codec 600 with only a scaling prior instead of a mean-scaling prior has significantly reduced compression performance (9.6% increase in BD rate) while having minimal efficiency gain (see row VI).

第二訓練階段的效果在行VII中示出。行VII示出了根據本文描述的輔助損失對神經編碼器600（行I）進行額外250步訓練的效果。The effect of the second training phase is shown in row VII. Row VII shows the effect of training the neural encoder 600 (row I) for an additional 250 steps according to the auxiliary loss described in this article.

在訓練後量化（PTQ）階段中，神經編解碼器600可以通過使少量資料通過網路並使用量化器量化的權重或啟動上的均方誤差（MSE）損失來更新量化器區間寬度和零項，從而學習量化器。為了增強性能，神經編解碼器600可以在PTQ階段之後跟隨量化感知訓練（QAT）階段，其中神經編解碼器600使用梯度下降來更新網路和量化器參數兩者。In the post-training quantization (PTQ) phase, the neural codec 600 can learn the quantizer by passing a small amount of data through the network and updating the quantizer bin width and zero terms using the mean squared error (MSE) loss on the quantizer quantized weights or activations. To enhance performance, the neural codec 600 can follow the PTQ phase with a quantization-aware training (QAT) phase, in which the neural codec 600 uses gradient descent to update both the network and quantizer parameters.

圖13是描繪根據本公開的一個或多個方面的用於不同訓練階段的不同示例超參數的表。不同訓練階段的超參數的示例可以在圖13中找到。在一些示例中，AIMET工具包可以用於量化神經編解碼器600。FIG13 is a table depicting different example hyperparameters for different training phases according to one or more aspects of the present disclosure. Examples of hyperparameters for different training phases can be found in FIG13. In some examples, the AIMET toolkit can be used to quantize the neural encoder/decoder 600.

現在討論損失函數。神經編解碼器600的率損失可以是上面討論的神經編解碼器600的三個自動編碼器中的每一個的潛量和超潛量的位元率之和。注意，與其他超先驗模型不同，本公開的超先驗模型可以包括用於潛量上的熵模型的均值-縮放超先驗，以及具有學習的方差和零均值的正態分佈，而不是用於超潛量上的熵模型的非參數分佈，以便於熵編解碼。Now let's discuss the loss function. The rate loss of the neural codec 600 can be the sum of the bit rates of the latent and superlatent of each of the three autoencoders of the neural codec 600 discussed above. Note that unlike other hyper-prior models, the hyper-prior model of the present disclosure can include a mean-scaled hyper-prior for an entropy model on the latent, and a normal distribution with a learned variance and zero mean, rather than a non-parametric distribution for an entropy model on the superlatent, to facilitate entropy coding and decoding.

例如，神經編解碼器600可以用權重6:1:1對Y:U:V通道的失真損失重新加權，使得失真損失與評估度量對準，如此： D（x, ） = For example, the neural codec 600 may re-weight the distortion loss of the Y:U:V channels with a weight of 6:1:1 so that the distortion loss is aligned with the evaluation metric, such that: D (x, ） =

以較低位元率訓練小模型的一個挑戰是由於錯誤累積，幀品質可能隨時間惡化。神經編解碼器600可以避免這個問題。首先，神經編解碼器600可以將I幀s，t的率損失乘數的值減半。使得I幀和P幀的所選操作點的PSNR值變得更類似於先前編解碼器。其次，神經編解碼器600可以使用指數調變的P幀損失，其中更遠離I幀的P幀具有更高的懲罰，例如： 𝐷 _mod（x, , 𝜏） = （x _i , ） One challenge with training small models at lower bit rates is that frame quality may deteriorate over time due to error accumulation. The neural codec 600 can avoid this problem. First, the neural codec 600 can halve the value of the rate loss multiplier for I frames s,t. This makes the PSNR values of the selected operating points for I frames and P frames more similar to previous codecs. Second, the neural codec 600 can use exponentially modulated P frame loss, where P frames farther away from the I frame have higher penalties, for example: 𝐷 _mod (x, , 𝜏） = （x _i , ）

另外，神經編解碼器600可以在訓練的第一階段期間使用輔助損失來強制網路分別學習有意義的外推和重構的流向量（f ^P , ）。這些損失可以包括原始幀與用運動場扭曲的先前重構之間的YUV 611均方誤差： D _flow（f, _t _-1,x _t ） = D（warp （x _t _-1,f） , x _t ）。 In addition, the neural codec 600 can use an auxiliary loss during the first phase of training to force the network to learn meaningful extrapolated and reconstructed flow vectors (f ^P , ). These losses can include the YUV 611 mean square error between the original frame and the _previous reconstruction using motion field warping: Dflow (f, _t _-1, x _t ) = D (warp (x _t _-1, f) , x _t ).

神經編解碼器600可以在評估時使用潛量和超潛量的捨入。在訓練期間，神經編解碼器600可以使用加性雜訊來估計率損失，以及對饋送到解碼器（例如，視頻解碼器300）中的潛量和超潛量路徑的捨入。The neural codec 600 can use rounding of potential and super-potential when evaluating. During training, the neural codec 600 can use additive noise to estimate rate loss and rounding of potential and super-potential paths fed into a decoder (e.g., video decoder 300).

神經編解碼器600的最終損失可以包括所有損失項的加權組合： L（x） = 𝛽𝑅（x ₁）+𝐷（x ₁, ₁）+2𝛽𝑅（x _＞1）+𝐷 _mod（x _＞1, _＞1, 𝜏） + 𝜆𝐷 _flow（f ^𝑃） + 𝜆𝐷 _flow（）. The final loss of the neural codec 600 may include a weighted combination of all loss terms: L (x) = 𝛽𝑅( _x1 )+𝐷( _x1 , ₁ )+2𝛽𝑅（x _＞1 ）+𝐷 _mod （x _＞1 , _＞1 , 𝜏) + 𝜆𝐷 _flow (f ^𝑃 ) + 𝜆𝐷 _flow ( ).

神經編解碼器600可以包括用於𝛽的每個值的新訓練的模型，並且使用𝜆和𝜏的不同值進行不同訓練階段（參見圖12）。The neural codec 600 may include a newly trained model for each value of 𝛽, and may perform different training phases using different values of 𝜆 and 𝜏 (see FIG12 ).

關於神經編解碼器600進行了實驗。神經編解碼器600的訓練可以分為四個階段：前兩個階段包括在率失真損失上端到端地訓練浮點模型，並且僅在其超參數方面不同。第三階段可以包括訓練後量化（PTQ），其中量化器是擬合的，同時保持模型參數固定。最後量化感知訓練（QAT）階段可包含使用直通估計器微調經量化模型的模型參數及量化參數兩者。各種訓練階段的超參數可以在圖13中找到。Experiments were conducted on the neural codec 600. The training of the neural codec 600 can be divided into four stages: the first two stages consist of training a floating point model end-to-end on a rate-distortion loss and differ only in its hyperparameters. The third stage may include post-training quantization (PTQ), where the quantizer is fitted while keeping the model parameters fixed. The final quantization-aware training (QAT) stage may include fine-tuning both the model parameters and the quantization parameters of the quantized model using a pass-through estimator. The hyperparameters for the various training stages can be found in FIG. 13 .

為了評估壓縮性能，分別計算Y、U和V通道上的峰值訊雜比（PSNR）。根據常見的評估協議，在權重為6:1:1的Y:U:V通道上對PSNR進行平均。BJtegaard-delta位元率（BD-rate）用於總結單個度量中的率失真性能。為了計算該度量，使用每像素位元數（bpp）和YUV 6:1:1 PSNR度量。在內插每條曲線之後，選擇位元率低於0.25bpp的所有點，並且丟棄由所有RD曲線的支援的交點定義的失真範圍之外的點，以進行公平比較。To evaluate the compression performance, the Peak Signal to Noise Ratio (PSNR) is calculated on the Y, U and V channels separately. Following common evaluation protocols, the PSNR is averaged over the Y:U:V channels with a weighting of 6:1:1. The BJtegaard-delta bitrate (BD-rate) is used to summarize the rate-distortion performance in a single metric. To calculate this metric, the bits per pixel (bpp) and the YUV 6:1:1 PSNR metric are used. After interpolating each curve, all points with a bitrate lower than 0.25bpp are selected and points outside the distortion range defined by the intersection of the support of all RD curves are discarded for a fair comparison.

圖14是描繪根據本公開的一個或多個方面的各種神經編碼器的示例模型複雜度的表。使用DeepSpeed測定FLOP計數。由於已知諸如PTflops的不同封包產生不同的FLOP計數，因此與先前編解碼器相關的FLOP計數已經盡可能使用相同的封包重新計算。細節可以在圖14中找到。對於基於硬體的結果，編碼器（例如，視頻編碼器200）和解碼器（例如，視頻解碼器300）演算法在設備（例如，計算設備402）上運行。量測熵編解碼位元流的大小以計算率並且基於經解碼的視頻計算PSNR。AIMET仿真的率失真結果與針對另一神經編解碼器報告的數字相當。FIG14 is a table depicting example model complexity for various neural encoders according to one or more aspects of the present disclosure. FLOP counts were measured using DeepSpeed. Since different packets of, for example, PTflops are known to produce different FLOP counts, FLOP counts associated with previous codecs have been recalculated using the same packets as much as possible. Details can be found in FIG14. For hardware-based results, encoder (e.g., video encoder 200) and decoder (e.g., video decoder 300) algorithms are run on a device (e.g., computing device 402). The size of the entropy-encoded decoded bitstream is measured to calculate the rate and the PSNR is calculated based on the decoded video. The rate-distortion results of the AIMET simulation are comparable to numbers reported for another neural codec.

圖14突出了神經編解碼器600和劃分為子網路的其他編碼器的模型複雜度。因為神經編解碼器600針對接收器推理速度進行了優化，所以神經編解碼器600的接收器具有神經編解碼器600的發送器的kMACs的25%，而對於其他模型，這是70-80%。另外，可以看出，由於神經編解碼器600的低維流向量，與其他編解碼器相比，神經編解碼器600處理運動的所有組件的複雜度非常低。最後，由於僅使用Y通道的事實，神經編解碼器600的運動自動編碼器在接收器側也更高效。可以注意到，在MAC計數中不考慮扭曲操作本身，但是如上面關於圖6所示和所述，扭曲可以與神經網路並行執行，因此不會導致額外的推理時間。Figure 14 highlights the model complexity of the neural codec 600 and the other codecs divided into subnetworks. Because the neural codec 600 is optimized for receiver inference speed, the receiver of the neural codec 600 has 25% of the kMACs of the transmitter of the neural codec 600, while for the other models this is 70-80%. Additionally, it can be seen that due to the low dimensional flow vectors of the neural codec 600, the complexity of all components of the neural codec 600 processing motion is very low compared to the other codecs. Finally, the motion autoencoder of the neural codec 600 is also more efficient on the receiver side due to the fact that only the Y channel is used. It can be noted that the warp operation itself is not considered in the MAC count, but as shown and described above with respect to Figure 6, the warp can be performed in parallel with the neural network and therefore does not incur additional inference time.

將神經編解碼器600與報告YUV性能但不報告完整視頻序列的性能的神經視頻壓縮技術進行比較。在這種情況下，當可能時，重新評估這些技術。針對其報告了行動設備實施方式結果的被神經編解碼的技術被訓練用於RGB。由於該模型最初是針對RGB訓練的，因此通過在YUV 6:1:1 R-D損失上微調該模型來修改該模型。測試H.265和H.264實施方式。為了進行公平比較，不使用B幀。The neural codec 600 is compared to neural video compression techniques that report YUV performance but not performance on full video sequences. In this case, the techniques are re-evaluated when possible. The neural codec technique for which results for mobile device implementations are reported is trained for RGB. Since the model was originally trained for RGB, the model is modified by fine-tuning it on YUV 6:1:1 R-D loss. Both H.265 and H.264 implementations are tested. For a fair comparison, B frames are not used.

神經編解碼器600在Vimeo90k上訓練，並且Xiph5N資料集用於驗證和早期停止。在許多標準視頻壓縮基準（包括HEVC-B測試序列、UVG-1k序列以及最後的MCL-JVC序列）上評估神經編解碼器600。The neural codec 600 is trained on Vimeo90k, and the Xiph5N dataset is used for validation and early stopping. The neural codec 600 is evaluated on many standard video compression benchmarks, including the HEVC-B test sequence, the UVG-1k sequence, and finally the MCL-JVC sequence.

圖15是描繪根據本公開的一個或多個方面的各種示例神經編解碼器的BD率性能結果的表。如上所述，神經編解碼器600可以表示為用於神經編解碼器600的量化版本的qodec_int8，以及用於神經編解碼器600的浮點版本的qodec_fp32。15 is a table depicting BD rate performance results for various example neural codecs according to one or more aspects of the present disclosure. As described above, the neural codec 600 can be represented as qodec_int8 for a quantized version of the neural codec 600, and qodec_fp32 for a floating point version of the neural codec 600.

圖16是示出根據本公開的技術的流不可知模型的模型架構的概念圖。圖16的架構1000在上面關於圖12被提及，並且表示執行隱式扭曲的流不可知神經編解碼器。Fig. 16 is a conceptual diagram showing a model architecture of a stream agnostic model according to the technology of the present disclosure. The architecture 1000 of Fig. 16 is mentioned above with respect to Fig. 12, and represents a stream agnostic neural codec that performs implicit warping.

圖17是示出根據本公開的一個或多個方面的實例解碼技術的流程圖。以上關於圖2-4描述了圖17的技術，但是可以由能夠這樣做的任何（一個或多個）設備來實踐。一個或多個處理器440可以對來自接收到的位元流的經編碼的視頻資料進行並行熵解碼，以生成經熵解碼的資料（1100）。例如，一個或多個處理器440和/或熵解碼器506的GPU412可以對來自接收到的位元流的經編碼的視頻資料進行並行熵解碼，以生成經熵解碼的資料。FIG. 17 is a flow chart illustrating an example decoding technique according to one or more aspects of the present disclosure. The technique of FIG. 17 is described above with respect to FIGS. 2-4 , but may be practiced by any (one or more) devices capable of doing so. One or more processors 440 may perform parallel entropy decoding of encoded video data from a received bitstream to generate entropy-decoded data (1100). For example, one or more processors 440 and/or a GPU 412 of an entropy decoder 506 may perform parallel entropy decoding of encoded video data from a received bitstream to generate entropy-decoded data.

一個或多個處理器440可以基於經熵解碼的資料來預測基於塊的運動向量以生成預測的運動向量（1102）。舉例來說，一個或多個處理器440可基於經熵解碼的資料預測基於塊的運動向量以產生預測的運動向量。經預測的運動向量可以是基於塊的預測的運動向量。一個或多個處理器440可以從經熵解碼的資料解碼運動向量殘差（1104）。例如，一個或多個處理器440可以執行神經網路模型422以從經熵解碼的資料解碼運動向量殘差𝛿f _𝑡。一個或多個處理器440可將運動向量殘差𝛿f _𝑡添加到預測的運動向量以產生基於塊的運動向量（1106）。例如，一個或多個處理器440可以使用加法流預測（加法630）將該運動向量殘差𝛿f _𝑡添加到預測的運動向量，以生成基於塊的運動向量。一個或多個處理器440可使用基於塊的運動向量利用基於重疊塊的扭曲函數來扭曲先前重構的視頻資料，以生成預測的當前視頻資料（1108）。例如，一個或多個處理器440可以使用基於塊的運動向量將基於重疊塊的扭曲函數（例如，圖4的扭曲612）應用於先前重構的視頻資料。 The one or more processors 440 may predict block-based motion vectors based on the entropy decoded data. To generate a predicted motion vector (1102). For example, the one or more processors 440 may predict a block-based motion vector based on the entropy decoded data. To generate the predicted motion vector . The predicted motion vector The one or more processors 440 may decode the motion vector residual from the entropy decoded data (1104). For example, the one or more processors 440 may execute the neural network model 422 to decode the motion vector residual 𝛿f _𝑡 from the entropy decoded data. The one or more processors 440 may add the motion vector residual 𝛿f _𝑡 to the predicted motion vector To generate a block-based motion vector (1106). For example, one or more processors 440 may add the motion vector residual 𝛿f _𝑡 to the predicted motion vector using an addition stream prediction (addition 630). , to generate block-based motion vectors The one or more processors 440 may use the block-based motion vectors to warp previously reconstructed video data using an overlapping block-based warping function to generate predicted current video data (1108). For example, the one or more processors 440 may use the block-based motion vectors to warp previously reconstructed video data using an overlapping block-based warping function to generate predicted current video data (1108). Applying a block-based warping function (e.g., warping 612 of FIG. 4 ) to the previously reconstructed video data .

一個或多個處理器440可以將預測的當前視頻資料與殘差塊相加，以生成當前重構的視頻資料（1110）。例如，一個或多個處理器440可以將預測的當前視頻資料與殘差塊相加，以生成當前重構的視頻資料。 The one or more processors 440 may add the predicted current video data to the residual block to generate the current reconstructed video data (1110). For example, the one or more processors 440 may add the predicted current video data to the residual block to generate the current reconstructed video data (1110). and residual block Added together to generate the currently reconstructed video data .

在一些示例中，作為解碼運動向量殘差的一部分，一個或多個處理器440被配置為使用神經網路模型422解碼基於像素的運動向量殘差。在一些示例中，神經網路模型422是量化感知訓練的。In some examples, as part of decoding the motion vector residual, the one or more processors 440 are configured to decode the pixel-based motion vector residual using the neural network model 422. In some examples, the neural network model 422 is quantization-aware trained.

在一些示例中，基於重疊塊的扭曲函數（例如，扭曲612）被配置為使用相應周圍塊的相應運動向量多次扭曲先前重構的視頻資料的塊，以生成扭曲結果並且使用衰減對扭曲結果進行平均。在一些示例中，作為對經編碼的視頻資料進行並行熵解碼的一部分，一個或多個處理器440被配置為利用至少一個圖形處理單元412對經編碼的視頻資料進行並行熵解碼。在一些示例中，作為扭曲先前重構的視頻資料的部分，一個或多個處理器440被配置為對先前重構的視頻資料進行基於塊的幀內插（FINT）扭曲。在一些示例中，作為對先前重構的視頻資料進行基於塊的FINT扭曲的部分，一個或多個處理器被配置為使用FINT核432對先前重構的視頻資料進行FINT扭曲。在一些示例中，FINT核432在神經網路訊號處理器430中實現。In some examples, the block-based warping function (e.g., warp 612) is configured to warp a block of previously reconstructed video data multiple times using corresponding motion vectors of corresponding surrounding blocks to generate warped results and average the warped results using attenuation. In some examples, as part of performing parallel entropy decoding on the coded video data, the one or more processors 440 are configured to perform parallel entropy decoding on the coded video data using at least one graphics processing unit 412. In some examples, as part of warping the previously reconstructed video data, the one or more processors 440 are configured to perform block-based frame interpolation (FINT) warping on the previously reconstructed video data. In some examples, as part of performing block-based FINT warping on previously reconstructed video data, one or more processors are configured to perform FINT warping on previously reconstructed video data using FINT core 432. In some examples, FINT core 432 is implemented in neural network signal processor 430.

在一些實例中，經編碼的視頻資料表示YUV420視頻資料，且當前重構的視頻包括YUV420視頻資料。在一些實例中，一個或多個處理器440被配置為量化經熵解碼的資料的至少一部分。在一些實例中，作為量化經熵解碼的資料的至少一部分的部分，一個或多個處理器440被配置為量化潛量、均值或縮放中的至少一個。在一些示例中，作為量化經熵解碼的資料的至少一部分的部分，一個或多個處理器440被配置為使用int8量化經熵解碼的資料的至少一部分。In some examples, the encoded video data represents YUV420 video data, and the currently reconstructed video includes YUV420 video data. In some examples, the one or more processors 440 are configured to quantize at least a portion of the entropy decoded data. In some examples, as part of quantizing at least a portion of the entropy decoded data, the one or more processors 440 are configured to quantize at least one of a potential, a mean, or a scale. In some examples, as part of quantizing at least a portion of the entropy decoded data, the one or more processors 440 are configured to quantize at least a portion of the entropy decoded data using int8.

在一些示例中，一個或多個處理器440還被配置為將流外推器610應用於經熵解碼的資料以生成外推流。在一些示例中，一個或多個處理器440還被配置為使用外推流來執行加法流預測630。在一些實例中，經編碼的視頻資料包含亮度資料。In some examples, the one or more processors 440 are further configured to apply the stream extrapolator 610 to the entropy decoded data to generate an extrapolated stream. In some examples, the one or more processors 440 are further configured to perform additive stream prediction 630 using the extrapolated stream. In some examples, the encoded video data includes luminance data.

圖18是描繪根據本公開的一個或多個方面的用於不同並行化技術的推理速度和率開銷的表。在一些示例中，一個或多個處理器440可以使用圖18中闡述的設置之一，諸如GPU412上的512個線程。18 is a table depicting inference speed and rate overhead for different parallelization techniques according to one or more aspects of the present disclosure. In some examples, one or more processors 440 may use one of the settings illustrated in FIG. 18, such as 512 threads on GPU 412.

圖19是圖解根據本公開的一個或多個方面的P幀模型的神經網路的示例模型架構的概念圖。圖19的示例可以表示神經網路模型422的模型架構。卷積層被顯示為𝑘 × 𝑘 𝑐，其中是𝑘指核大小並且𝑐是指輸出通道的數量。具有步幅𝑠的卷積由↓ 𝑠表示，並且具有步幅𝑠的轉置卷積被示出為↑ 𝑠。FIG. 19 is a conceptual diagram illustrating an example model architecture of a neural network of a P-frame model according to one or more aspects of the present disclosure. The example of FIG. 19 may represent a model architecture of a neural network model 422. Convolutional layers are shown as 𝑘 × 𝑘 𝑐, where 𝑘 refers to the kernel size and 𝑐 refers to the number of output channels. A convolution with a stride 𝑠 is represented by ↓ 𝑠, and a transposed convolution with a stride 𝑠 is shown as ↑ 𝑠.

圖20是示出根據本公開的一個或多個方面的模型消融和量化消融的測試結果的圖形圖。在圖8的表中更詳細地描述了模型消融曲線圖1200。在圖12的表中更詳細地描述量化消融曲線圖1202。FIG20 is a graphical diagram showing test results of model ablation and quantitative ablation according to one or more aspects of the present disclosure. Model ablation curve graph 1200 is described in more detail in the table of FIG8. Quantitative ablation curve graph 1202 is described in more detail in the table of FIG12.

圖21是示出UVG和MCL上的不同神經視頻編解碼器的率失真性能的曲線圖。FIG21 is a graph showing the rate-distortion performance of different neural video codecs on UVG and MCL.

本公開的技術的方面包括以下條款。Aspects of the technology disclosed herein include the following items.

條款1A.一種解碼視頻資料的方法，所述方法包括：對來自所接收位元流的經編碼的視頻資料進行熵解碼以產生經熵解碼的資料；使用神經網路模型對經熵解碼的資料進行解碼以生成運動向量和殘差；利用扭曲函數扭曲所述運動向量以生成扭曲的運動向量；以及將扭曲的運動向量與殘差相加，以生成重構的視頻資料。Clause 1A. A method for decoding video data, the method comprising: entropy decoding encoded video data from a received bit stream to produce entropy decoded data; decoding the entropy decoded data using a neural network model to generate motion vectors and residues; warping the motion vectors using a warping function to generate warped motion vectors; and adding the warped motion vectors to the residues to generate reconstructed video data.

條款2A.根據條款1A所述的方法，其中所述運動向量包括所述視頻資料的塊的運動向量。Clause 2A. A method according to Clause 1A, wherein the motion vectors include motion vectors of blocks of the video data.

條款3A.根據條款1A或條款2A所述的方法，其中熵解碼所述經編碼的視頻資料包括對所述經編碼的視頻資料並行熵解碼。Clause 3A. A method according to clause 1A or clause 2A, wherein entropy decoding the encoded video data includes entropy decoding the encoded video data in parallel.

條款4A.根據條款3A所述的方法，其中對所述經編碼的視頻資料進行並行熵解碼包括利用圖形處理單元對所述經編碼的視頻資料進行並行熵解碼。Clause 4A. The method of clause 3A, wherein performing parallel entropy decoding on the encoded video data comprises performing parallel entropy decoding on the encoded video data using a graphics processing unit.

條款5A.條款2A-條款4A中任一項所述的方法，其中扭曲運動向量包括基於塊的幀內插（FINT）扭曲運動向量。Clause 5A. The method of any of Clauses 2A-4A, wherein the warp motion vector comprises a block-based frame interpolation (FINT) warp motion vector.

條款6A.條款據條款5A所述的方法，其中對所述運動向量進行基於塊的FINT扭曲包括使用FINT核對所述運動向量進行FINT扭曲。Clause 6A. Clause 5A. The method of clause 5A, wherein performing a block-based FINT warp on the motion vector comprises performing a FINT warp on the motion vector using a FINT core.

條款7A.根據條款6A所述的方法，其中，在神經網路訊號處理器中實現FINT核。Clause 7A. A method according to Clause 6A, wherein the FINT core is implemented in a neural network signal processor.

條款8A.根據條款1A-條款7A中任一項所述的方法，其中經編碼的視頻資料表示YUV420視頻資料，並且重構的視頻包括YUV420視頻資料。Clause 8A. A method as described in any of Clauses 1A-7A, wherein the encoded video data represents YUV420 video data and the reconstructed video includes YUV420 video data.

條款9A.根據條款1A-8A中任一項所述的方法，其中，使用所述神經網路對所述經熵解碼的資料進行解碼包括量化所述經熵解碼的資料。Clause 9A. A method according to any of Clauses 1A-8A, wherein decoding the entropy-decoded data using the neural network includes quantizing the entropy-decoded data.

條款10A.根據條款9A所述的方法，其中，量化經熵解碼的資料包括量化潛量、均值或縮放中的至少一個。Clause 10A. A method according to clause 9A, wherein quantizing the entropy decoded data includes quantizing at least one of a potential, a mean, or a scaling.

條款11A.根據條款9A或條款10A所述的方法，其中，量化經熵解碼的資料包括使用int8量化經熵解碼的資料的至少一部分。Clause 11A. A method according to Clause 9A or Clause 10A, wherein quantizing the entropy decoded data includes quantizing at least a portion of the entropy decoded data using int8.

條款12A.根據條款9A-11A中任一項所述的方法，其中神經網路模型是量化感知訓練的。Clause 12A. A method according to any of Clauses 9A-11A, wherein the neural network model is trained using quantized perception.

條款13A.根據條款1A-12A中任一項所述的方法，還包括將流外推器應用於經熵解碼的資料以生成外推流。Clause 13A. The method of any of Clauses 1A-12A, further comprising applying a stream extrapolator to the entropy decoded data to generate an extrapolated stream.

條款14A.根據條款13A所述的方法，還包括使用所述外推流來執行加法流預測。Clause 14A. The method of Clause 13A, further comprising using the extrapolated stream to perform additive stream prediction.

條款15A.根據條款1A-14A中任一項所述的方法，其中所述視頻資料包括亮度資料。Clause 15A. A method according to any of Clauses 1A-14A, wherein the video data includes brightness data.

條款16A.一種解碼視頻資料的方法，該方法包括：解碼超潛量；外推第一流；從超潛量解碼均值和縮放；基於所述縮放對潛量進行解碼；基於所述潛量和所述第一流重構第二流和殘差；使用所述第二流扭曲先前重構的幀以生成扭曲的幀；以及將所述殘差添加到所述扭曲的幀。Clause 16A. A method for decoding video data, the method comprising: decoding a superlatent; extrapolating a first stream; decoding a mean and a scaling from the superlatent; decoding the latent based on the scaling; reconstructing a second stream and a residue based on the latent and the first stream; warping a previously reconstructed frame using the second stream to generate a warped frame; and adding the residue to the warped frame.

條款17A.一種對使用條款1A-16A中任一項所述的方法解碼的視頻資料進行編碼的方法。Clause 17A. A method of encoding video data decoded using the method described in any of Clauses 1A-16A.

條款18A.一種用於編解碼視頻資料的設備，所述設備包括：記憶體，其用於儲存所述視頻資料；以及一個或多個處理器，其被配置為執行條款1A-17A中任一項所述的方法。Clause 18A. A device for encoding and decoding video data, the device comprising: a memory for storing the video data; and one or more processors configured to execute the method described in any one of Clauses 1A-17A.

條款19A.一種儲存指令的計算機可讀儲存媒體，所述指令在被執行時使一個或多個處理器執行條款1A-17A中任一項所述的方法。Clause 19A. A computer-readable storage medium storing instructions that, when executed, cause one or more processors to perform the method of any of Clauses 1A-17A.

條款20A。一種用於編解碼視頻資料的設備，所述設備包括用於執行條款1A-17A中任一項所述的方法的一個或多個裝置。Clause 20A. An apparatus for encoding or decoding video data, the apparatus comprising one or more means for performing the method of any of clauses 1A-17A.

條款1B. 一種用於編解碼視頻資料的設備，所述設備包括：記憶體，其用於儲存所述視頻資料，所述視頻資料包括先前重構的視頻資料及當前重構的視頻資料；及一個或多個處理器，其被配置為：對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料；基於所述經熵解碼的資料來預測基於塊的運動向量以生成預測的運動向量；從所述經熵解碼的資料解碼運動向量殘差；將所述運動向量殘差與基於塊的運動向量相加，以生成所述基於塊的運動向量；使用所述基於塊的運動向量以基於重疊塊的扭曲函數扭曲所述先前重構的視頻資料以產生預測的當前視頻資料；以及對所述預測的當前視頻資料與殘差塊求和以產生所述當前重構的視頻資料。Clause 1B. A device for encoding and decoding video data, the device comprising: a memory for storing the video data, the video data comprising previously reconstructed video data and currently reconstructed video data; and one or more processors configured to: perform parallel entropy decoding on the encoded video data from a received bit stream to generate entropy decoded data; predict block-based motion vectors based on the entropy decoded data to generate predicted The method comprises the steps of: detecting a motion vector of the video data of the video frame of the embodiment of the present invention; decoding a motion vector residual from the entropy decoded data; adding the motion vector residual to a block-based motion vector to generate the block-based motion vector; using the block-based motion vector to warp the previously reconstructed video data with a warping function based on overlapping blocks to generate predicted current video data; and summing the predicted current video data with the residual block to generate the currently reconstructed video data.

條款2B.根據條款1B所述的設備，其中作為解碼所述運動向量殘差的部分，所述一個或多個處理器被配置為使用神經網路模型解碼基於像素的運動向量殘差。Clause 2B. A device as described in Clause 1B, wherein as part of decoding the motion vector residual, the one or more processors are configured to decode the pixel-based motion vector residual using a neural network model.

條款3B.根據條款2B所述的設備，其中，所述神經網路模型是量化感知訓練的。Clause 3B. An apparatus as described in Clause 2B, wherein the neural network model is trained for quantized perception.

條款4B.根據條款1B-3B中任一項所述的設備，其中所述基於重疊塊的扭曲函數被配置為：使用相應周圍塊的相應運動向量多次扭曲所述先前重構的視頻資料的塊以生成扭曲結果；以及使用衰減對扭曲結果求平均。Clause 4B. An apparatus according to any of Clauses 1B-3B, wherein the overlapping block-based warping function is configured to: warp the block of previously reconstructed video data multiple times using corresponding motion vectors of corresponding surrounding blocks to generate a warped result; and average the warped results using attenuation.

條款5B.根據條款1B-4B中任一項所述的設備，其中作為對所述經編碼的視頻資料進行並行熵解碼的一部分，所述一個或多個處理器被配置為利用至少一個圖形處理單元對所述經編碼的視頻資料進行並行熵解碼。Clause 5B. An apparatus as described in any of Clauses 1B-4B, wherein as part of parallel entropy decoding of the encoded video data, the one or more processors are configured to utilize at least one graphics processing unit to parallel entropy decode the encoded video data.

條款6B.根據條款1B-5B中任一項所述的設備，其中作為扭曲先前重構的視頻資料的部分，一個或多個處理器被配置為對先前重構的視頻資料進行基於塊的幀內插（FINT）扭曲。Clause 6B. An apparatus as described in any of Clauses 1B-5B, wherein as part of warping previously reconstructed video data, the one or more processors are configured to perform block-based frame interpolation (FINT) warping on the previously reconstructed video data.

條款7B.根據條款6B所述的設備，其中作為對所述先前重構的視頻資料進行基於塊的FINT扭曲的部分，所述一個或多個處理器被配置為使用FINT核對所述先前重構的視頻資料進行FINT扭曲。Clause 7B. An apparatus as described in Clause 6B, wherein as part of performing block-based FINT warping on the previously reconstructed video data, the one or more processors are configured to perform FINT warping on the previously reconstructed video data using FINT checks.

條款8B.根據條款7B的設備，其中FINT核在神經網路訊號處理器中實現。Clause 8B. A device according to Clause 7B, wherein the FINT core is implemented in a neural network signal processor.

條款9B.根據條款1B-8B中任一項所述的設備，其中編碼的視頻資料表示YUV420視頻資料，並且當前重構的視頻資料包括YUV420視頻資料。Clause 9B. An apparatus as described in any of clauses 1B-8B, wherein the encoded video data represents YUV420 video data and the currently reconstructed video data includes YUV420 video data.

條款10B.根據條款1B-9B中任一項所述的設備，其中所述一個或多個處理器還被配置為量化所述經熵解碼的資料的至少一部分。Clause 10B. An apparatus as described in any of clauses 1B-9B, wherein the one or more processors are further configured to quantize at least a portion of the entropy decoded data.

條款11B.根據條款10B所述的設備，其中作為量化所述經熵解碼的資料的至少一部分的部分，所述一個或多個處理器被配置為量化潛量、均值或縮放中的至少一個。Clause 11B. An apparatus as described in Clause 10B, wherein as part of quantizing at least a portion of the entropy decoded data, the one or more processors are configured to quantize at least one of a potential, a mean, or a scale.

條款12B.根據條款10B或條款11B所述的設備，其中作為量化所述經熵解碼的資料的至少一部分的部分，所述一個或多個處理器被配置為使用int8量化所述經熵解碼的資料的至少一部分。Clause 12B. An apparatus as described in Clause 10B or Clause 11B, wherein as part of quantizing at least a portion of the entropy decoded data, the one or more processors are configured to quantize at least a portion of the entropy decoded data using int8.

條款13B.根據條款1B-12B中任一項所述的設備，其中一個或多個處理器還被配置為將流外推器應用於經熵解碼的資料以生成外推流。Clause 13B. An apparatus as described in any of clauses 1B-12B, wherein the one or more processors are further configured to apply a stream extrapolator to the entropy decoded data to generate an extrapolated stream.

條款14B。根據條款13B所述的設備，其中所述一個或多個處理器還被配置為使用外推流來執行加法流預測。Clause 14B. The apparatus of clause 13B, wherein the one or more processors are further configured to perform additive stream prediction using the extrapolated stream.

條款15B.根據條款1B-14B中任一項所述的設備，其中所述經編碼的視頻資料包括亮度資料。Clause 15B. An apparatus as described in any of Clauses 1B-14B, wherein the encoded video data includes luminance data.

條款16B.一種解碼視頻資料的方法，所述方法包括：對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料；基於所述經熵解碼的資料預測基於塊的運動向量以生成預測的運動向量；從經熵解碼的資料解碼運動向量殘差；將所述運動向量殘差與所述預測的運動向量相加以產生所述基於塊的運動向量；使用所述基於塊的運動向量以基於重疊塊的扭曲函數扭曲先前重構的視頻資料以產生預測的當前視頻資料；以及將所述預測的當前視頻資料與殘差塊求和以產生當前重構的視頻資料。Clause 16B. A method for decoding video data, the method comprising: performing parallel entropy decoding on encoded video data from a received bit stream to produce entropy decoded data; predicting block-based motion vectors based on the entropy decoded data to generate predicted motion vectors; decoding motion vector residues from the entropy decoded data; adding the motion vector residues to the predicted motion vectors to produce the block-based motion vectors; using the block-based motion vectors to warp previously reconstructed video data using a warping function based on overlapping blocks to produce predicted current video data; and summing the predicted current video data with the residue block to produce current reconstructed video data.

條款17B.根據條款16B所述的方法，其中解碼所述運動向量殘差包括使用神經網路模型解碼基於像素的運動向量殘差。Clause 17B. A method according to Clause 16B, wherein decoding the motion vector residual comprises decoding pixel-based motion vector residual using a neural network model.

條款18B.根據條款17B所述的方法，其中，所述神經網路模型是量化感知訓練的。Clause 18B. A method according to Clause 17B, wherein the neural network model is trained for quantitative perception.

條款19B.根據條款16B-18B中任一項所述的方法，其中利用所述基於重疊塊的扭曲函數扭曲所述先前重構的視頻資料包括：使用相應周圍塊的相應運動向量多次扭曲所述先前重構的視頻資料的塊以生成扭曲結果；以及使用衰減對扭曲結果求平均。Clause 19B. A method according to any one of Clauses 16B-18B, wherein warping the previously reconstructed video data using the overlapping block-based warping function includes: multiple warping of the blocks of the previously reconstructed video data using corresponding motion vectors of corresponding surrounding blocks to generate a warped result; and averaging the warped results using attenuation.

條款20B.根據條款16B至19B中任一項所述的方法，其中對所述經編碼的視頻資料進行並行熵解碼包括利用至少一個圖形處理單元對所述經編碼的視頻資料進行並行熵解碼。Clause 20B. A method according to any one of clauses 16B to 19B, wherein the parallel entropy decoding of the encoded video data includes parallel entropy decoding of the encoded video data using at least one graphics processing unit.

條款21B.根據條款16B-20B中任一項所述的方法，其中對所述先前重構的視頻資料進行扭曲包括對所述先前重構的視頻資料進行基於塊的幀內插（FINT）扭曲。Clause 21B. A method as described in any of Clauses 16B-20B, wherein warping the previously reconstructed video data includes performing block-based frame interpolation (FINT) warping on the previously reconstructed video data.

條款22B.根據條款21B所述的方法，其中對所述先前重構的視頻資料進行基於塊的FINT扭曲包括使用FINT核對所述先前重構的視頻資料進行FINT扭曲。Clause 22B. A method according to Clause 21B, wherein performing a block-based FINT warp on the previously reconstructed video data includes performing the FINT warp on the previously reconstructed video data using a FINT check.

條款23B.根據條款22B所述的方法，其中，在神經網路訊號處理器中實現所述FINT核。Clause 23B. A method according to Clause 22B, wherein the FINT core is implemented in a neural network signal processor.

條款24B.根據條款16B-23B中任一項所述的方法，其中經編碼的視頻資料表示YUV420視頻資料，並且重構的視頻包括YUV420視頻資料。Clause 24B. A method as described in any of clauses 16B-23B, wherein the encoded video data represents YUV420 video data and the reconstructed video includes YUV420 video data.

條款25B.根據條款16B至24B中任一項所述的方法，還包括量化所述經熵解碼的資料的至少一部分。Clause 25B. A method according to any one of clauses 16B to 24B, further comprising quantizing at least a portion of the entropy decoded data.

條款26B.根據條款25B所述的方法，其中量化所述經熵解碼的資料的至少一部分包括量化潛量、均值或縮放中的至少一個。Clause 26B. A method according to clause 25B, wherein quantizing at least a portion of the entropy decoded data includes quantizing at least one of a potential, a mean, or a scaling.

條款27B.根據條款25B或條款26B所述的方法，其中所述量化所述經熵解碼的資料的至少一部分包括使用int8量化所述經熵解碼的資料的至少一部分。Clause 27B. A method according to Clause 25B or Clause 26B, wherein quantizing at least a portion of the entropy decoded data includes quantizing at least a portion of the entropy decoded data using int8.

條款28B.根據條款16B至27B中任一項所述的方法，還包括將流外推器應用於經熵解碼的資料以生成外推流。Clause 28B. A method according to any of clauses 16B to 27B, further comprising applying a stream extrapolator to the entropy decoded data to generate an extrapolated stream.

條款29B.根據條款28B所述的方法，還包括使用外推流進行加法流預測。Clause 29B. The method according to Clause 28B also includes using an extrapolated stream to perform additive stream prediction.

條款30B.根據條款16B至29B中任一項所述的方法，其中所述編碼的視頻資料包括亮度資料。Clause 30B. A method according to any of clauses 16B to 29B, wherein the encoded video data includes luminance data.

條款31B.一種用於對視頻資料進行編解碼的設備，所述設備包括：用於對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料的裝置；用於基於所述經熵解碼的資料預測基於塊的運動向量以產生預測的運動向量的裝置；用於從所述經熵解碼的資料解碼運動向量殘差的裝置；用於將所述運動向量殘差與所述預測的運動向量相加以產生所述基於塊的運動向量的裝置；用於使用所述基於塊的運動向量以基於重疊塊的扭曲函數扭曲先前重構的視頻資料以產生預測的當前視頻資料的裝置；及用於將預測的當前視頻資料與殘差塊求和以產生當前重構的視頻資料的裝置。Clause 31B. An apparatus for encoding and decoding video data, the apparatus comprising: means for performing parallel entropy decoding of encoded video data from a received bit stream to produce entropy decoded data; means for predicting block-based motion vectors based on the entropy decoded data to produce predicted motion vectors; means for decoding motion vector residuals from the entropy decoded data; means for adding the motion vector residual to the predicted motion vector to generate the block-based motion vector; means for warping previously reconstructed video data using the block-based motion vector with a warping function based on overlapping blocks to generate predicted current video data; and means for summing the predicted current video data with the residual block to generate current reconstructed video data.

條款32B.一種儲存指令的非暫時性計算機可讀儲存媒體，所述指令在被執行時使得一個或多個處理器：對來自所接收位元流的經編碼的視頻資料進行並行熵解碼以產生經熵解碼的資料；基於所述經熵解碼的資料來預測基於塊的運動向量以生成預測的運動向量；從所述經熵解碼的資料解碼運動向量殘差；將所述運動向量殘差添加到所述預測的運動向量以產生所述基於塊的運動向量；使用所述基於塊的運動向量以基於重疊塊的扭曲函數扭曲先前重構的視頻資料以產生預測的當前視頻資料；以及對所述預測的當前視頻資料與殘差塊求和以產生當前重構的視頻資料。Clause 32B. A non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to: perform parallel entropy decoding of encoded video data from a received bitstream to produce entropy-decoded data; predict block-based motion vectors based on the entropy-decoded data to generate predicted motion vectors; and The method comprises the steps of: decoding a motion vector residual of the video data; adding the motion vector residual to the predicted motion vector to generate the block-based motion vector; using the block-based motion vector to warp the previously reconstructed video data with a warping function based on overlapping blocks to generate predicted current video data; and summing the predicted current video data with the residual block to generate the current reconstructed video data.

在一個或多個實例中，所描述的功能可以硬體、軟體、韌體或其任何組合來實施。如果以軟體實施，那麼所述功能可作為一個或多個指令或代碼儲存在計算機可讀媒體上且由基於硬體的處理單元執行。計算機可讀媒體可包含計算機可讀儲存媒體，其對應於例如資料儲存媒體的有形媒體。以此方式，計算機可讀媒體通常可對應於非暫時性的有形計算機可讀儲存媒體。資料儲存媒體可為可由一個或多個計算機或一個或多個處理器存取以檢索用於本公開中所描述的技術中的實施方式的指令、代碼及/或資料結構的任何可用媒體。計算機程式產品可以包括計算機可讀媒體。In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions or codes on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to tangible media such as data storage media. In this manner, computer-readable media may generally correspond to non-transitory, tangible computer-readable storage media. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, codes, and/or data structures for implementation in the technology described in this disclosure. A computer program product may include computer-readable media.

借助於實例而非限制，此類計算機可讀儲存媒體可包括RAM、ROM、EEPROM、CD-ROM或其它光碟儲存設備、磁碟儲存設備或其它磁性儲存設備、快閃記憶體或可用以儲存呈指令或資料結構形式的所要程式代碼且可由計算機存取的任何其它媒體。應理解，計算機可讀儲存媒體和資料儲存媒體並不包含載波、訊號或其它暫時性媒體，而是針對非暫時性有形儲存媒體。如本文中所使用，磁碟及光碟包含壓縮光碟（CD）、雷射光碟、光學光碟、數位多功能光碟（DVD）、軟碟及藍光光碟，其中磁碟通常以磁性方式再現資料，而光碟用雷射以光學方式再現資料。上述的組合也應當被包括在計算機可讀媒體的範圍內。By way of example and not limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage devices, magnetic disk storage devices or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and can be accessed by a computer. It should be understood that computer-readable storage media and data storage media do not include carrier waves, signals, or other temporary media, but are directed to non-temporary tangible storage media. As used herein, magnetic disks and optical disks include compressed discs (CDs), laser discs, optical discs, digital versatile discs (DVDs), floppy disks, and Blu-ray discs, where magnetic disks typically reproduce data magnetically, while optical discs use lasers to reproduce data optically. The above combinations should also be included within the scope of computer-readable media.

指令可由一個或多個處理器執行，所述一個或多個處理器例如一個或多個數位訊號處理器（DSP）通用微處理器、專用積體電路（ASIC）現場可程式化邏輯陣列（FPGA）或其它等效整合或離散邏輯電路。因此，如本文中所使用的術語「處理器」可以指代前述結構中的任一個或適合於本文中所描述的技術中的實施方式的任何其它結構。另外，在一些方面中，本文中所描述的功能性可提供於被配置為用於編碼及解碼的專用硬體及/或軟體模組內，或併入於組合編碼器中。此外，這些技術可以在一個或多個電路或邏輯元件中完全實現。The instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Therefore, the term "processor" as used herein may refer to any of the aforementioned structures or any other structure suitable for implementation in the techniques described herein. In addition, in some aspects, the functionality described herein may be provided in dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined encoder. Furthermore, these techniques may be fully implemented in one or more circuits or logic elements.

本公開的技術可實施於廣泛多種設備或設備中，所述設備或設備包含無線手持機、積體電路（IC）或IC集合（例如，晶片組）。本公開中描述各種組件、模組或單元以強調被配置為執行所揭示技術的設備的功能方面，但未必需要由不同硬體單元實現。確切地說，如上文所描述，各種單元可組合於編碼器硬體單元中，或由互操作硬體單元（包含如上文所描述的一個或多個處理器）的集合結合合適的軟體和/或韌體提供。The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including wireless handsets, integrated circuits (ICs), or collections of ICs (e.g., chipsets). Various components, modules, or units are described in this disclosure to emphasize functional aspects of a device configured to perform the disclosed techniques, but do not necessarily need to be implemented by different hardware units. Specifically, as described above, the various units may be combined in an encoder hardware unit, or provided by a collection of interoperable hardware units (including one or more processors as described above) in conjunction with appropriate software and/or firmware.

本公開還包括所附附錄，其形成本公開的一部分並且明確地併入本文。附錄中公開的技術可以與本文公開的技術組合或分開執行。This disclosure also includes the attached appendix, which forms a part of this disclosure and is explicitly incorporated herein. The technology disclosed in the appendix can be combined or performed separately with the technology disclosed herein.

已經描述了各種示例。這些和其他示例在所附申請專利範圍的範圍內。Various examples have been described. These and other examples are within the scope of the appended claims.

100:系統 102:源設備 104:視頻源 106:記憶體 108:輸出介面 110:計算機可讀媒體 112:儲存設備 114:檔案伺服器 116:目的地設備 118:顯示設備 120:記憶體 122:輸入介面 200:視頻編碼器 300:視頻解碼器 402:計算設備 404:用戶輸入介面 406:CPU 408:記憶體控制器 410:系統記憶體 412:圖形處理單元（GPU） 414:本地記憶體 416:顯示介面 418:顯示器 420:匯流排 422:神經網路模型 424:相機 430:神經網路訊號處理器（NSP） 432:幀內插（FINT）核 500:基準真相 502:I幀編碼器（IFE） 504:熵編碼器（EE） 506:熵解碼器（ED） 508:I幀解碼器（IFD） 510:重構幀（recon） 512:扭曲函數（warp）/扭曲 514:GT 516:運動估計（ME） 518:P幀編碼器（PFE） 520:P幀解碼器（PFD） 522:運動向量（MV） 524:塊殘差（resid） 528:recon 530:輸出 532:GT 600:神經編解碼器 602:發送器 604:接收器 610:流外推器/流外推器網路 612:扭曲/扭曲算子 614:求和 620:殘差自動編碼器/自動編碼器 630:加法 640:Y通道 650:流自動編碼器 700:管線 800:函數 900:管線/模型複雜度分析 950:率/失真曲線圖 1000:架構 1100, 1102, 1104, 1106, 1108, 1110:步驟 1200:模型消融曲線圖 1202:量化消融曲線圖 100: System 102: Source device 104: Video source 106: Memory 108: Output interface 110: Computer-readable media 112: Storage device 114: File server 116: Destination device 118: Display device 120: Memory 122: Input interface 200: Video encoder 300: Video decoder 402: Computing device 404: User input interface 406: CPU 408: Memory controller 410: System memory 412: Graphics processing unit (GPU) 414: Local memory 416: Display interface 418: Display 420: Bus 422: Neural Network Model 424: Camera 430: Neural Network Signal Processor (NSP) 432: Frame Interpolation (FINT) Core 500: Ground Truth 502: I-Frame Encoder (IFE) 504: Entropy Encoder (EE) 506: Entropy Decoder (ED) 508: I-Frame Decoder (IFD) 510: Reconstructed Frame (recon) 512: Warp Function (warp)/Warp 514: GT 516: Motion Estimation (ME) 518: P-Frame Encoder (PFE) 520: P-Frame Decoder (PFD) 522: Motion Vector (MV) 524: Block Resid 528: Recon 530: Output 532: GT 600: Neural Encoder/Decoder 602: Transmitter 604: Receiver 610: Stream Extrapolator/Stream Extrapolator Network 612: Warp/Warp Operator 614: Sum 620: Resid Autoencoder/Autoencoder 630: Addition 640: Y Channel 650: Stream Autoencoder 700: Pipeline 800: Function 900: Pipeline/Model Complexity Analysis 950: Rate/Distortion Plot 1000: Architecture 1100, 1102, 1104, 1106, 1108, 1110: Step 1200: Model ablation curve 1202: Quantitative ablation curve

圖1是示出可執行本公開的技術的示例媒體編碼和解碼系統的方塊圖。FIG1 is a block diagram illustrating an example media encoding and decoding system in which the techniques of the present disclosure may be implemented.

圖2是示出可執行本公開的技術的示例計算設備的方塊圖。FIG. 2 is a block diagram illustrating an example computing device that may implement the techniques of this disclosure.

圖3是示出根據本公開的一個或多個方面的神經視頻編碼器的端到端深度學習的示例的概念圖。FIG3 is a conceptual diagram illustrating an example of end-to-end deep learning of a neural video encoder according to one or more aspects of the present disclosure.

圖4是示出根據本公開的一個或多個方面的神經視頻編碼器的示例的架構圖。Figure 4 is an architectural diagram showing an example of a neural video encoder according to one or more aspects of the present disclosure.

圖5示出根據本公開一個或多個方面的基於塊的扭曲的示例的概念圖。FIG5 illustrates a conceptual diagram of an example of block-based warping according to one or more aspects of the present disclosure.

圖6是示出視頻解碼器300中的管線推理的示例的架構圖。FIG. 6 is an architectural diagram illustrating an example of pipeline reasoning in the video decoder 300. As shown in FIG.

圖7是根據本公開的一個或多個方面的示例神經視頻編碼器的概念圖。Figure 7 is a conceptual diagram of an example neural video encoder according to one or more aspects of the present disclosure.

圖8是描繪根據本公開的一個或多個方面的神經編碼器模型架構消融的示例結果的表。Figure 8 is a table depicting example results of ablation of a neural encoder model architecture according to one or more aspects of the present disclosure.

圖9是示出根據本公開的一個或多個方面的可在示例視頻編碼器中實施的示例技術的概念圖。FIG. 9 is a conceptual diagram illustrating example techniques that may be implemented in an example video encoder in accordance with one or more aspects of the present disclosure.

圖10是圖解根據本公開的一個或多個方面的設備上的神經視頻編碼器的示例的概念圖。Figure 10 is a conceptual diagram illustrating an example of a neural video encoder on a device according to one or more aspects of the present disclosure.

圖11是將各種技術的壓縮性能和計算以及率失真性能與本公開的技術進行比較的一組曲線圖。FIG11 is a set of graphs comparing the compression performance and computational and rate-distortion performance of various technologies with the technology disclosed herein.

圖12是描繪根據本公開的一個或多個方面的示例神經編碼器的各種量化的示例BD率性能結果的表。Figure 12 is a table depicting various quantized example BD rate performance results for an example neural encoder according to one or more aspects of the present disclosure.

圖13是描繪根據本公開的一個或多個方面的用於不同訓練階段的不同示例超參數的表。Figure 13 is a table depicting different example hyperparameters for different training stages according to one or more aspects of the present disclosure.

圖14是描繪根據本公開的一個或多個方面的各種神經編碼器的示例模型複雜度的表。Figure 14 is a table depicting example model complexities for various neural encoders according to one or more aspects of the present disclosure.

圖15是描繪根據本公開的一個或多個方面的各種示例神經編碼器的BD率性能結果的表。Figure 15 is a table depicting BD rate performance results for various example neural encoders according to one or more aspects of the present disclosure.

圖16是示出根據本公開的技術的流不可知模型的模型架構的概念圖。Figure 16 is a conceptual diagram showing the model architecture of the flow-agnostic model according to the technology of the present disclosure.

圖17是示出根據本公開的一個或多個方面的實例解碼技術的流程圖。FIG. 17 is a flow chart illustrating an example decoding technique according to one or more aspects of the present disclosure.

圖18是描繪根據本公開的一個或多個方面的用於不同並行化技術的推理速度和率開銷的表。FIG. 18 is a table depicting inference speed and rate overhead for different parallelization techniques according to one or more aspects of the present disclosure.

圖19是圖解根據本公開的一個或多個方面的P幀模型的神經網路的示例模型架構的概念圖。Figure 19 is a conceptual diagram illustrating an example model architecture of a neural network of a P-frame model according to one or more aspects of the present disclosure.

圖20是示出根據本公開的一個或多個方面的模型消融和量化消融的測試結果的曲線圖。Figure 20 is a graph showing test results of model ablation and quantitative ablation according to one or more aspects of the present disclosure.

圖21是示出UVG和MCL上的不同神經視頻編碼器的率失真性能的曲線圖。Figure 21 is a graph showing the rate-distortion performance of different neural video encoders on UVG and MCL.

600:神經編解碼器 600:Neural codec

602:發送器 602: Transmitter

604:接收器 604: Receiver

610:流外推器/流外推器網路 610: Stream Extrapolator/Stream Extrapolator Network

612:扭曲/扭曲算子 612: Distortion/Twist Operator

614:求和 614:Summary

620:殘差自動編碼器/自動編碼器 620: Residual error automatic encoder/automatic encoder

630:加法 630: Addition

640:Y通道 640:Y channel

650:流自動編碼器 650: Streaming automatic encoder

Claims

A device for encoding video data, the device comprising: A memory for storing the video data, the video data comprising previously reconstructed video data and currently reconstructed video data; and One or more processors configured to: Perform parallel entropy decoding on the encoded video data from a received bit stream to generate entropy decoded data; Predict block-based motion vectors based on the entropy decoded data to generate predicted motion vectors; Decode motion vector residuals from the entropy decoded data; Add the motion vector residuals to the predicted motion vectors to generate block-based motion vectors; Using the block-based motion vectors to warp the previously reconstructed video data with an overlapping block-based warping function to generate predicted current video data; and summing the predicted current video data with a residual block to generate the currently reconstructed video data.

A device as described in claim 1, wherein, as part of decoding the motion vector residual, the one or more processors are configured to decode the pixel-based motion vector residual using a neural network model.

An apparatus according to claim 2, wherein the neural network model is trained using quantized perception.

The device of claim 1, wherein the overlapping block-based warping function is configured to: warp the block of the previously reconstructed video data multiple times using corresponding motion vectors of corresponding surrounding blocks to generate a warped result; and average the warped results using attenuation.

The apparatus of claim 1, wherein, as part of performing parallel entropy decoding on the coded video data, the one or more processors are configured to perform parallel entropy decoding on the coded video data using at least one graphics processing unit.

The apparatus of claim 1, wherein, as part of warping the previously reconstructed video data, the one or more processors are configured to perform a block-based frame interpolation (FINT) warp on the previously reconstructed video data.

The apparatus of claim 6, wherein, as part of performing block-based FINT warping on the previously reconstructed video data, the one or more processors are configured to perform FINT warping on the previously reconstructed video data using FINT checks.

An apparatus as described in claim 7, wherein the FINT core is implemented in a neural network signal processor.

The apparatus of claim 1, wherein the encoded video data represents YUV420 video data and the currently reconstructed video data comprises YUV420 video data.

A device as described in claim 1, wherein the one or more processors are also configured to quantize at least a portion of the entropy decoded data.

The apparatus of claim 10, wherein, as part of quantizing at least a portion of the entropy decoded data, the one or more processors are configured to quantize at least one of a potential, a mean, or a scale.

The apparatus of claim 10, wherein, as part of quantizing at least a portion of the entropy decoded data, the one or more processors are configured to quantize at least a portion of the entropy decoded data using int8.

The apparatus of claim 1, wherein the one or more processors are further configured to apply a stream extrapolator to the entropy decoded data to generate an extrapolated stream.

The apparatus of claim 13, wherein the one or more processors are further configured to perform additive stream prediction using the extrapolated stream.

The apparatus of claim 1, wherein the encoded video data includes luminance data.

A method for decoding video data, the method comprising: performing parallel entropy decoding on encoded video data from a received bitstream to produce entropy decoded data; predicting block-based motion vectors based on the entropy decoded data to generate predicted motion vectors; decoding motion vector residuals from the entropy decoded data; adding the motion vector residuals to the predicted motion vectors to produce the block-based motion vectors; using the block-based motion vectors to warp previously reconstructed video data with a warping function based on overlapping blocks to produce predicted current video data; and summing the predicted current video data with the residual block to produce current reconstructed video data.

A method according to claim 16, wherein decoding the motion vector residual includes decoding the pixel-based motion vector residual using a neural network model.

A method according to claim 17, wherein the neural network model is trained using quantized perception.

The method of claim 16, wherein warping the previously reconstructed video data using the overlapping block-based warping function comprises: warping the block of the previously reconstructed video data multiple times using corresponding motion vectors of corresponding surrounding blocks to generate a warped result; and averaging the warped results using attenuation.

The method of claim 16, wherein the parallel entropy decoding of the encoded video data comprises parallel entropy decoding of the encoded video data using at least one graphics processing unit.

The method of claim 16, wherein warping the previously reconstructed video data comprises performing a block-based frame interpolation (FINT) warp on the previously reconstructed video data.

The method of claim 21, wherein performing block-based FINT warping on the previously reconstructed video data comprises performing FINT warping on the previously reconstructed video data using a FINT check.

A method according to claim 22, wherein the FINT core is implemented in a neural network signal processor.

The method of claim 16, wherein the encoded video data represents YUV420 video data and the currently reconstructed video includes YUV420 video data.

The method according to claim 16 also includes quantizing at least a portion of the entropy decoded data.

A method as claimed in claim 25, wherein quantizing at least a portion of the entropy decoded data comprises quantizing at least one of a potential, a mean, or a scaling.

A method according to claim 25, wherein quantizing at least a portion of the entropy decoded data includes quantizing at least a portion of the entropy decoded data using int8.

The method according to claim 16 also includes: applying a stream extrapolator to the entropy decoded data to generate an extrapolated stream.

The method of claim 28 further comprises using the extrapolated stream to perform additive stream prediction.

A method according to claim 16, wherein the encoded video data includes brightness data.

A device for encoding and decoding video data, the device comprising: Means for performing parallel entropy decoding of encoded video data from a received bit stream to generate entropy decoded data; Means for predicting block-based motion vectors based on the entropy decoded data to generate predicted motion vectors; Means for decoding motion vector residues from the entropy decoded data; Means for adding the motion vector residues to the predicted motion vectors to generate block-based motion vectors; Means for warping previously reconstructed video data using the block-based motion vectors with a warping function based on overlapping blocks to generate predicted current video data; and Means for summing predicted current video data with a residual block to produce current reconstructed video data.

A non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to: perform parallel entropy decoding of encoded video data from a received bitstream to produce entropy-decoded data; predict block-based motion vectors based on the entropy-decoded data to generate predicted motion vectors; decode motion vector residuals from the entropy-decoded data; add the motion vector residuals to the predicted motion vectors to produce block-based motion vectors; warp previously reconstructed video data using the block-based motion vectors with a warping function based on overlapping blocks to produce predicted current video data; and The predicted current video data and the residual block are summed to produce the current reconstructed video data.