HK1226567B

HK1226567B - Methods and systems for reducing latency in video encoding and decoding

Info

Publication number: HK1226567B
Application number: HK16114638.1A
Authority: HK
Inventors: Gary J. Sullivan
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2011-06-30
Filing date: 2016-12-23
Publication date: 2020-01-24

Description

Method and system for reducing delay in video encoding and decoding

背景技术Background Art

工程师们使用压缩(也被称作源编码或信源编码)来降低数字视频的比特率。压缩通过将视频信息转换成较低比特率的形式来减少存储和传输该信息的成本。解压缩(也被称作解码)从被压缩的形式中重建一个版本的原始信息。“编解码器”是编码器/解码器系统。Engineers use compression (also known as source coding) to reduce the bit rate of digital video. Compression reduces the cost of storing and transmitting video information by converting it into a lower bit rate form. Decompression (also known as decoding) reconstructs a version of the original information from the compressed form. A "codec" is an encoder/decoder system.

在过去的二十年中，已经采用了各种视频编解码器标准，包含H.261、H.262(MPEG-2或ISO/IEC 13818-2)、H.263和H.264(AVC或ISO/IEC 14496-10)标准和 MPEG-1(ISO/IEC 11172-2)、MPEG-4Visual(ISO/IEC 14496-2)和SMPTE 421M标准。最近，正在开发HEVC标准。视频编解码器标准典型地定义了编码视频比特流的语法选项，其详述了当在编码和解码中使用特定特征时该比特流中的参数。在许多情况下，视频编解码器标准还提供了关于解码器应当执行以便在解码中获得正确结果的解码操作的详情。Over the past two decades, various video codec standards have been adopted, including H.261, H.262 (MPEG-2 or ISO/IEC 13818-2), H.263, and H.264 (AVC or ISO/IEC 14496-10), as well as MPEG-1 (ISO/IEC 11172-2), MPEG-4 Visual (ISO/IEC 14496-2), and SMPTE 421M. More recently, the HEVC standard is under development. Video codec standards typically define syntax options for coded video bitstreams, detailing parameters within the bitstream when specific features are used during encoding and decoding. In many cases, video codec standards also provide details on the decoding operations that a decoder should perform to achieve correct decoding results.

压缩的基本目标是提供良好的速率失真性能。因此，针对特定的比特率，编码器试图提供最高质量的视频。或者，针对特定的质量水平/对原始视频的保真度水平，编码器试图提供最低比特率的编码视频。在实践中，根据使用的情境，对诸如编码时间、编码复杂度、编码资源、解码时间、解码复杂度、解码资源、总延时、和/或回放中的平滑度之类的考虑也影响编码和解码期间所做出的决定。The fundamental goal of compression is to provide good rate-distortion performance. Thus, for a given bitrate, an encoder attempts to provide the highest quality video. Alternatively, for a given quality level/fidelity level to the original video, an encoder attempts to provide the lowest bitrate encoded video. In practice, depending on the context of use, considerations such as encoding time, encoding complexity, encoding resources, decoding time, decoding complexity, decoding resources, total latency, and/or smoothness during playback also influence the decisions made during encoding and decoding.

例如，考虑诸如从存储装置回放视频、从在网络连接上流传送的编码数据回放视频，以及视频转码(从一种比特率到另一种比特率，或者从一种标准到另一种标准)之类的使用情境。在编码器端，这样的应用可以准许对时间完全不敏感的离线编码。因此，编码器可以增加编码时间并且增加编码期间使用的资源来找到最有效的方式压缩视频，并且从而改进速率失真性能。如果在解码器端还可以接受少量的延时，那么该编码器可以例如通过采用来自序列中更前面的各图像的图像间相关性来进一步改善速率失真性能。For example, consider use cases such as playing back video from storage, playing back video from encoded data streamed over a network connection, and transcoding video (from one bitrate to another, or from one standard to another). On the encoder side, such applications can allow for completely time-insensitive offline encoding. Consequently, the encoder can increase encoding time and resources used during encoding to find the most efficient way to compress the video and thereby improve rate-distortion performance. If a small amount of latency is also acceptable on the decoder side, the encoder can further improve rate-distortion performance, for example, by exploiting inter-picture correlations from earlier pictures in the sequence.

另一方面，考虑诸如远程桌面会议、监控视频、视频电话和其它实时通信情境之类的使用情境。这样的应用对时间敏感。输入图像的录制与输出图像的回放之间的低延迟是性能的一个关键因素。当适配于非实时通信的编码/解码工具被应用于实时通信情境中时，总延迟通常高得难以接受。这些工具在编码和解码期间引入的延时可以改善常规视频回放的性能，但是它们破坏了实时通信。On the other hand, consider use cases such as remote desktop conferencing, surveillance video, video telephony, and other real-time communication scenarios. Such applications are time-sensitive. Low latency between recording the input image and replaying the output image is a key performance factor. When encoding/decoding tools adapted for non-real-time communication are applied to real-time communication scenarios, the overall latency is often unacceptably high. The delays introduced by these tools during encoding and decoding may improve the performance of regular video playback, but they undermine real-time communication.

发明内容Summary of the Invention

总之，具体实施方式部分提出了用于降低视频编码和解码中的延迟的技术和工具。该技术和工具可以降低延迟以便改善实时通信中的响应性。例如，该技术和工具通过约束由于视频帧重排序导致的延迟，并且通过利用伴随用于该视频帧的编码数据的一个或多个语法元素指示对帧重排序延迟的约束来降低总延迟。In summary, the detailed description presents techniques and tools for reducing latency in video encoding and decoding. The techniques and tools can reduce latency to improve responsiveness in real-time communications. For example, the techniques and tools reduce overall latency by constraining the latency due to video frame reordering and by indicating the constraints on the frame reordering delay using one or more syntax elements accompanying the encoded data for the video frame.

根据本文描述的技术和工具的一个方面，诸如视频编码器之类的工具、具有视频编码器的实时通信工具或其它工具设置指示对延迟的约束（例如与视频序列的多帧之间的帧间相关性一致的对帧重排序延迟的约束）的一个或多个语法元素。该工具输出该（多个）语法元素，从而促进更简单和更快速地确定按照帧的输出顺序重建帧何时为输出做好准备。According to one aspect of the techniques and tools described herein, a tool, such as a video encoder, a real-time communication tool with a video encoder, or other tool, sets one or more syntax elements indicating a constraint on a delay, such as a constraint on a frame reordering delay consistent with inter-frame dependencies between multiple frames of a video sequence. The tool outputs the syntax element(s) to facilitate a simpler and faster determination of when a reconstructed frame is ready for output in its output order.

根据本文描述的技术和工具的另一个方面，诸如视频解码器之类的工具、具有视频解码器的实时通信工具或其它工具接收并解析指示对延迟的约束（例如对帧重排序延迟的约束）的一个或多个语法元素。该工具还接收用于视频序列的多个帧的编码数据。解码该编码数据中的至少一些来重建多帧中的一个。该工具可以基于该（多个）语法元素确定对延迟的约束，然后使用对延迟的约束来确定重建帧何时为输出做好准备（按照输出顺序）。该工具输出该重建帧。According to another aspect of the techniques and tools described herein, a tool, such as a video decoder, a real-time communication tool having a video decoder, or other tool, receives and parses one or more syntax elements indicating a constraint on delay (e.g., a constraint on a frame reordering delay). The tool also receives encoded data for a plurality of frames of a video sequence. At least some of the encoded data is decoded to reconstruct one of the plurality of frames. The tool may determine the constraint on the delay based on the syntax element(s), and then use the constraint on the delay to determine when the reconstructed frame is ready for output (in output order). The tool outputs the reconstructed frame.

本发明前述的和其它的目的、特征和优点将在下文参考附图进行的详细描述中变得更加显而易见。The foregoing and other objects, features and advantages of the present invention will become more apparent from the following detailed description made with reference to the accompanying drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是其中可以实现一些所描述的实施例的示例计算系统的图示。FIG1 is a diagram of an example computing system in which some described embodiments may be implemented.

图2a和2b是其中可以实现一些所描述的实施例的示例网络环境的图示。2a and 2b are diagrams of example network environments in which some described embodiments may be implemented.

图3是与之结合可以实现一些所描述的实施例的示例编码器系统的图示。3 is a diagram of an example encoder system in conjunction with which some described embodiments may be implemented.

图4是与之结合可以实现一些所描述的实施例的示例解码器系统的图示。4 is a diagram of an example decoder system in conjunction with which some described embodiments may be implemented.

图5a-5e是示出了用于若干示例系列中的帧的编码顺序和输出顺序的图示。5a-5e are diagrams showing the encoding order and output order of frames for several example series.

图6是示出了用于设置和输出指示对延迟的约束的一个或多个语法元素的示例技术的流程图。6 is a flow diagram illustrating an example technique for setting and outputting one or more syntax elements indicating constraints on delays.

图7是示出用于降低延迟的解码的示例技术的流程图。7 is a flow chart illustrating an example technique for decoding with reduced latency.

具体实施方式DETAILED DESCRIPTION

具体实施方式部分提出了用于减低视频编码和解码中的延迟的技术和工具。该技术和工具可以帮助降低延迟以便改善实时通信中的响应性。The detailed description of the embodiments presents techniques and tools for reducing latency in video encoding and decoding. These techniques and tools can help reduce latency to improve responsiveness in real-time communications.

在视频编码/解码情境中，接收到输入视频帧的时间与回放该帧的时间之间的一些延时是不可避免的。该帧由编码器编码，被输送到解码器并且由解码器解码，并且一些量的延迟由关于编码资源、解码资源和/或网络带宽的实际限制所导致。但是，其它的延迟是可以避免的。例如，为了改善速率失真性能（例如为了采用来自序列中更前面的图像的帧间相关性），编码器和译码器可能引入延迟。这样的延迟可以被降低，尽管可能在速率失真性能、处理器使用或回放平滑度方面有所损失。In video encoding/decoding scenarios, some delay between the time an input video frame is received and the time it is played back is unavoidable. The frame is encoded by the encoder, fed to the decoder, and decoded by the decoder, and some amount of delay is caused by practical limitations on encoding resources, decoding resources, and/or network bandwidth. However, other delays can be avoided. For example, to improve rate-distortion performance (e.g., to exploit inter-frame correlations from earlier images in the sequence), encoders and decoders may introduce delays. Such delays can be reduced, although there may be a loss in rate-distortion performance, processor usage, or playback smoothness.

利用本文所描述的技术和工具，通过约束延迟（因此，限制帧间相关性的时间范围）并且向解码器指示对延迟的该约束来减少延迟。例如，对延迟的该约束是对帧重排序延迟的约束。可替换地，对延迟的该约束是按照秒、毫秒或另一种时间量度的约束。解码器之后可以确定对延迟的该约束并且当确定哪些帧为输出做好准备时使用该约束。这样，可以为远程桌面会议、视频电话、视频监控、摄像头视频和其它实时通信应用降低延时。Using the techniques and tools described herein, latency is reduced by constraining the delay (thereby limiting the time range of inter-frame correlations) and indicating this constraint on the delay to the decoder. For example, the constraint on the delay is a constraint on the frame reordering delay. Alternatively, the constraint on the delay is a constraint in seconds, milliseconds, or another time metric. The decoder can then determine the constraint on the delay and use it when determining which frames are ready for output. This can reduce latency for remote desktop conferencing, video telephony, video surveillance, webcam video, and other real-time communication applications.

本文描述的一些创新通过参考专用于H.264和/或HEVC标准的语法元素和操作来说明。这样的创新也可以被实现用于其它标准或格式。Some of the innovations described herein are illustrated with reference to syntax elements and operations specific to the H.264 and/or HEVC standards. Such innovations may also be implemented for other standards or formats.

更一般地，对本文描述的示例的各种替换是可能的。可以通过改变流程图中示出的阶段排序，通过拆分、重复或忽略某些阶段等等来变更参考流程图描述的某些技术。针对视频编码和解码的延迟降低的各方面可以组合使用或者单独使用。不同的实施例使用一种或多种所描述的技术和工具。本文描述的一些技术和工具解决了背景技术中指出的一个或多个问题。典型地，一种给定的技术/工具不能解决所有这样的问题。More generally, various alternatives to the examples described herein are possible. Certain techniques described with reference to the flowcharts can be modified by changing the order of the stages shown in the flowcharts, by splitting, repeating, or omitting certain stages, and so on. Various aspects of delay reduction for video encoding and decoding can be used in combination or individually. Different embodiments utilize one or more of the described techniques and tools. Some of the techniques and tools described herein address one or more of the problems identified in the background. Typically, a given technique/tool does not address all such problems.

I. 示例计算系统I. Example Computing System

图1图示了合适的计算系统（100）的一般化示例，其中可以实现若干所描述的技术和工具。计算系统（100）并不旨在暗示对使用或功能性的范围的任何限制，这是因为该技术和工具可以被实现在多种多样的通用或专用计算系统中。Figure 1 illustrates a generalized example of a suitable computing system (100) in which several of the described techniques and tools may be implemented. The computing system (100) is not intended to suggest any limitation as to scope of use or functionality, as the techniques and tools may be implemented in a wide variety of general-purpose or special-purpose computing systems.

参考图1，该计算系统（100）包含一个或多个处理单元（110,115）和存储器（120,125）。在图1中，这种最基本的配置（130）被包含在虚线内。该处理单元（110,115）运行计算机可执行指令。处理单元可以是通用中央处理单元（CPU）、专用集成电路（ASIC）中的处理器或其它类型的处理器。在多路处理系统中，多个处理单元运行计算机可执行指令来增加处理能力。例如，图1示出了中央处理单元（110）以及图形处理单元或协处理单元（115）。有形存储器（120,125）可以是该（多个）处理单元可访问的易失性存储器（例如寄存器、高速缓存、RAM）、非易失性存储器（例如ROM、EEPROM、闪速存储器等等），或这二者的一些组合。该存储器（120,125）以适于由该（多个）处理单元运行的计算机可执行指令的形式存储了实现用于降低视频编码和解码中的延迟的一个或多个创新的软件（180）。Referring to FIG1 , the computing system (100) includes one or more processing units (110, 115) and memory (120, 125). In FIG1 , this most basic configuration (130) is enclosed within the dashed lines. The processing units (110, 115) run computer-executable instructions. The processing units can be general-purpose central processing units (CPUs), processors in application-specific integrated circuits (ASICs), or other types of processors. In a multi-processing system, multiple processing units run computer-executable instructions to increase processing power. For example, FIG1 shows a central processing unit (110) and a graphics processing unit or co-processing unit (115). The tangible memory (120, 125) can be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some combination of the two, accessible to the processing unit(s). The memory (120, 125) stores software (180) implementing one or more innovations for reducing latency in video encoding and decoding in the form of computer-executable instructions suitable for execution by the processing unit(s).

计算系统可以具有附加的特征。例如，该计算系统（100）包含存储装置（140）、一个或多个输入设备（150）、一个或多个输出设备（160），以及一个或多个通信连接（170）。诸如总线、控制器或网络之类的互连机制（未示出）互连了该计算系统（100）的各组件。典型地，操作系统软件（未示出）提供用于运行在该计算系统（100）中的其它软件的操作环境，并且协调该计算系统（100）的各组件的活动。The computing system may have additional features. For example, the computing system (100) includes storage (140), one or more input devices (150), one or more output devices (160), and one or more communication connections (170). An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing system (100). Typically, operating system software (not shown) provides an operating environment for other software running in the computing system (100) and coordinates the activities of the components of the computing system (100).

该有形存储装置（140）可以是可移动或不可移动的，并且包含可以被用于以非暂时的方式存储信息并且可以在该计算系统（100）内被访问的磁盘、磁带或磁盒、CD-ROM、DVD或任何其它的媒介。该存储装置（140）存储用于实现用于在视频编码和解码中的延迟降低的一个或多个创新的该软件（180）的指令。The tangible storage device (140) may be removable or non-removable and includes a disk, tape or cartridge, CD-ROM, DVD, or any other medium that can be used to store information in a non-transitory manner and that can be accessed within the computing system (100). The storage device (140) stores instructions for the software (180) that implement one or more innovations for delay reduction in video encoding and decoding.

该（多个）输入设备（150）可以是诸如键盘、鼠标、手写笔或跟踪球之类的触摸输入设备、声音输入设备、扫描设备或另一种为该计算系统（100）提供输入的设备。对于视频编码，该（多个）输入设备（150）可以是摄像机、视频卡、电视调谐器卡，或接受模拟或数字形式的输入的类似设备，或者将视频样本读入该计算系统（100）的CD-ROM或CD-RW。该（多个）输出设备（160）可以是显示器、打印机、扬声器、CD刻录机，或另一种提供来自该计算系统（100）的输出的设备。The input device(s) (150) may be a touch input device such as a keyboard, mouse, stylus, or trackball, a voice input device, a scanning device, or another device that provides input to the computing system (100). For video encoding, the input device(s) (150) may be a camera, a video card, a TV tuner card, or a similar device that accepts input in analog or digital form, or a CD-ROM or CD-RW that reads video samples into the computing system (100). The output device(s) (160) may be a display, a printer, speakers, a CD burner, or another device that provides output from the computing system (100).

该（多个）通信连接（170）使得能够实现通过通信媒介到另一计算实体的通信。该通信媒介传送诸如计算机可执行指令、音频或视频输入或输出、或调制数据信号中的其它数据之类的信息。调制数据信号是这样一种信号：它的一个或多个特性以使得便于在该信号中编码信息的方式被设置或改变。作为示例而非限制，通信介质可以使用电学的、光学的、RF或其它的载体。The communication connection(s) (170) enable communication to another computing entity via a communication medium. The communication medium carries information such as computer-executable instructions, audio or video input or output, or other data in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to facilitate encoding information in the signal. By way of example and not limitation, the communication medium may employ an electrical, optical, RF, or other carrier.

可以在计算机可读介质的一般背景下描述该技术和工具。计算机可读介质是任何可以在计算环境内被访问的可用有形介质。作为示例而非限制，对于计算系统（100），计算机可读介质包含存储器（120,125）、存储装置（140）和任何上述的组合。The techniques and tools may be described in the general context of computer-readable media. Computer-readable media is any available tangible media that can be accessed within a computing environment. By way of example and not limitation, with respect to computing system (100), computer-readable media includes memory (120, 125), storage (140), and any combination thereof.

可以在计算机可执行指令的一般背景下描述该技术和工具，诸如被包含在程序模块中并在计算系统中被运行在真实或虚拟的目标处理器上的那些指令之类。一般地，程序模块包含执行特定任务或实现特定抽象数据类型的例程、程序、库、对象、类、组件、数据结构等等。根据各种实施例中的期望，该程序模块的功能性可以在程序模块之间组合或拆分。可以在本地或分布式计算系统内运行用于程序模块的计算机可执行指令。The techniques and tools may be described in the general context of computer-executable instructions, such as those contained in program modules and executed on a real or virtual target processor in a computing system. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, and the like that perform specific tasks or implement specific abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. The computer-executable instructions for the program modules may be executed within a local or distributed computing system.

本文中术语“系统”和“设备”被可互换地使用。除非上下文明确指出，否则任何一个术语都不暗示任何对计算系统或计算设备的类型的限制。一般地，计算系统或计算设备可以是本地的或是分布式的，并且可以包含具有实现本文所描述功能性的软件的专用硬件和/或通用硬件的任何组合。The terms "system" and "device" are used interchangeably herein. Unless the context clearly indicates otherwise, neither term implies any limitation on the type of computing system or computing device. Generally, a computing system or computing device can be local or distributed and can include any combination of specialized hardware and/or general-purpose hardware with software that implements the functionality described herein.

出于演示的目的，具体描述使用了类似“确定”和“使用”的术语来描述计算系统中的计算机操作。这些术语是对计算机执行的操作的高度抽象，并且不应该与人执行的动作相混淆。对应于这些术语的实际计算机操作根据实现而变化。For presentation purposes, this description uses terms like "determine" and "use" to describe computer operations in a computing system. These terms are high-level abstractions of the operations performed by a computer and should not be confused with actions performed by a person. The actual computer operations corresponding to these terms vary depending on the implementation.

II. 示例网络环境II. Sample Network Environment

图2a和2b示出包含视频编码器（220）和视频解码器（270）的示例网络环境（201,202）。通过使用适当的通信协议在网络（250）上连接该编码器（220）和解码器（270）。该网络（250）可以包含因特网或其它计算机网络。Figures 2a and 2b illustrate an example network environment (201, 202) including a video encoder (220) and a video decoder (270). The encoder (220) and decoder (270) are connected over a network (250) using an appropriate communication protocol. The network (250) may include the Internet or another computer network.

在图2a中示出的网络环境（201）中，每个实时通信（“RTC”）工具（210）包含编码器（220）和解码器（270）二者以用于双向通信。给定的编码器（220）可以产生服从SMPTE 421M标准、ISO-IEC 14496-10标准（也被称为H.264或AVC）、HEVC标准、其它标准、或私有格式的输出，对应的解码器（270）接受来自该编码器（220）的编码数据。该双向通信可以是视频会议、视频电话呼叫或其它两方通信情境的一部分。尽管图2a中的网络环境（201）包含两个实时通信工具（210），但是作为替换，该网络环境（201）可以包含参与多方通信的三个或更多实时通信工具（210）。In the network environment (201) shown in Figure 2a, each real-time communication ("RTC") tool (210) includes both an encoder (220) and a decoder (270) for two-way communication. A given encoder (220) can produce output that complies with the SMPTE 421M standard, the ISO-IEC 14496-10 standard (also known as H.264 or AVC), the HEVC standard, other standards, or a proprietary format, and the corresponding decoder (270) accepts encoded data from the encoder (220). The two-way communication can be part of a video conference, a video phone call, or other two-party communication scenario. Although the network environment (201) in Figure 2a includes two real-time communication tools (210), as an alternative, the network environment (201) can include three or more real-time communication tools (210) participating in multi-party communication.

实时通信工具（210）管理编码器（220）的编码。图3示出了能够被包含在该实时通信工具（210）中的示例编码器系统（300）。可替换地，该实时通信工具（210）使用另一种编码器系统。实时通信工具（210）还管理解码器（270）的解码。图4示出了可以被包含在该实时通信工具（210）中的示例解码器系统（400）。可替换地，该实时通信工具（210）使用另一种编码器系统。The real-time communication tool (210) manages encoding by the encoder (220). FIG3 shows an example encoder system (300) that can be included in the real-time communication tool (210). Alternatively, the real-time communication tool (210) uses another encoder system. The real-time communication tool (210) also manages decoding by the decoder (270). FIG4 shows an example decoder system (400) that can be included in the real-time communication tool (210). Alternatively, the real-time communication tool (210) uses another encoder system.

在图2b示出的网络环境（202）中，编码工具（212）包含编码用于输送到多个回放工具（214）的视频的编码器（220），其中该多个回放工具包含解码器（270）。可以为视频监控系统、摄像头监视系统、远程桌面会议演示或其中视频被编码并从一个位置发送到一个或多个其它位置的其它情境提供单向通信。尽管图2b中的网络环境（202）包含两个回放工具（214），该网络环境（202）可以包含更多或更少的回放工具（214）。一般地，回放工具（214）与该编码工具（212）通信来为该回放工具（214）确定要接收的视频流。该回放工具（214）接收该流，将所接收的编码数据缓冲适当时间周期，然后开始解码和回放。In the network environment (202) shown in Figure 2b, an encoding tool (212) includes an encoder (220) that encodes video for delivery to multiple playback tools (214), wherein the multiple playback tools include decoders (270). One-way communication can be provided for video surveillance systems, camera surveillance systems, remote desktop conference presentations, or other scenarios in which video is encoded and sent from one location to one or more other locations. Although the network environment (202) in Figure 2b includes two playback tools (214), the network environment (202) can include more or fewer playback tools (214). Generally, the playback tool (214) communicates with the encoding tool (212) to determine the video stream to be received by the playback tool (214). The playback tool (214) receives the stream, buffers the received encoded data for an appropriate period of time, and then begins decoding and playback.

图3示出了可以被包含在该编码工具（212）中的示例编码器系统（300）。可替换地，该编码工具（212）使用另一种编码器系统。该编码工具（212）还可以包含用于管理与一个或多个回放工具（214）的连接的服务器端控制器逻辑。图4示出了可以被包含在该回放工具（214）中的示例解码器系统（400）。可替换地，该回放工具（214）使用另一种解码器系统。回放工具（214）还可以包含用于管理与该编码工具（212）的连接的客户端控制器逻辑。FIG3 illustrates an example encoder system (300) that may be included in the encoding tool (212). Alternatively, the encoding tool (212) uses another encoder system. The encoding tool (212) may also include server-side controller logic for managing connections with one or more playback tools (214). FIG4 illustrates an example decoder system (400) that may be included in the playback tool (214). Alternatively, the playback tool (214) uses another decoder system. The playback tool (214) may also include client-side controller logic for managing connections with the encoding tool (212).

在一些情况下，指示延迟（例如帧重排序延迟）的语法元素的使用专用于特定的标准或格式。例如，编码数据可以含有指示对延迟的约束的一个或多个语法元素，该一个或多个语法元素作为根据该标准或格式定义的基本编码视频比特流的语法的一部分，或者作为有关该编码数据的所定义的媒体元数据。在这些情况下，具有降低延迟的该实时通信工具（210）、编码工具（212）和/或回放工具（214）可以是编解码器相关的，因为它们做出的决定可以取决于用于特定标准或格式的比特流语法。In some cases, the use of syntax elements that indicate delay (e.g., frame reordering delay) is specific to a particular standard or format. For example, the coded data may contain one or more syntax elements indicating constraints on delay as part of a syntax of an elementary coded video bitstream defined in accordance with the standard or format, or as media metadata defined about the coded data. In these cases, the real-time communication tool (210), encoding tool (212), and/or playback tool (214) with reduced delay may be codec-dependent in that the decisions they make may depend on the bitstream syntax used for the particular standard or format.

在其它情况下，指示对延迟（例如帧重排序延迟）约束的语法元素的使用在特定标准或格式之外。例如，指示对延迟的约束的（多个）语法元素作为媒体传输流、媒体存储文件或更一般地，媒体系统多路复用协议或传输协议的语法的一部分可以通过发信号来通知。或者可以根据媒体属性协商协议在实时通信工具（210）、编码工具（212）和/或回放工具（214）之间协商指示延迟的（多个）语法元素。在这些情况下，具有降低延迟的该实时通信工具（210）、编码工具（212）和回放工具（214）可以是编解码器无关的，因为它们可以与任何可用的视频编码器和解码器一起工作，假设对帧间相关性的控制水平在编码期间固定。In other cases, the use of syntax elements indicating constraints on delay (e.g., frame reordering delay) is outside of a specific standard or format. For example, the syntax element(s) indicating the constraints on delay can be signaled as part of the syntax of a media transport stream, a media storage file, or more generally, a media system multiplexing protocol or transport protocol. Alternatively, the syntax element(s) indicating the delay can be negotiated between the real-time communication tool (210), the encoding tool (212), and/or the playback tool (214) according to a media attribute negotiation protocol. In these cases, the real-time communication tool (210), the encoding tool (212), and the playback tool (214) with reduced delay can be codec-independent in that they can work with any available video encoder and decoder, assuming that the level of control over inter-frame correlation is fixed during encoding.

III. 示例编码器系统III. Example Encoder System

图3是与之结合可以实现一些所描述的实施例的示例编码器系统（300）的框图。该编码器系统（300）可以是能够工作在诸如用于实时通信的低延迟编码模式、转码模式、以及用于回放自文件或流的媒体的常规编码模式之类的多个编码模式中的任何一个中的通用编码工具，或者它可以是被适配用于一个这样的编码模式的专用编码工具。该编码器系统（300）可以被实现为操作系统模块、应用库的一部分或者独立的应用。大体来说，该编码器系统（300）接收来自视频源（310）的源视频帧（311）序列并且产生作为被输出到信道（390）的编码数据。被输出到该信道的编码数据可以包含指示对延迟（例如帧重排序延迟）的约束的一个或多个语法元素来促进降低延迟的解码。FIG3 is a block diagram of an example encoder system (300) in conjunction with which some of the described embodiments may be implemented. The encoder system (300) may be a general-purpose encoding tool capable of operating in any of a plurality of encoding modes, such as a low-latency encoding mode for real-time communication, a transcoding mode, and a conventional encoding mode for playback of media from a file or stream, or it may be a dedicated encoding tool adapted for use in one such encoding mode. The encoder system (300) may be implemented as an operating system module, as part of an application library, or as a standalone application. In general, the encoder system (300) receives a sequence of source video frames (311) from a video source (310) and produces encoded data that is output to a channel (390). The encoded data output to the channel may contain one or more syntax elements indicating constraints on delay (e.g., frame reordering delay) to facilitate decoding with reduced delay.

该视频源（310）可以是摄像机、调谐器卡、存储介质、或其它数字视频源。该视频源（310）产生帧率在例如每秒30帧的视频帧序列。正如本文中所使用的，术语“帧”一般指源、编码或重建的图像数据。对于逐行扫描视频，一帧是一个逐行扫描视频帧。对于隔行扫描视频，在示例实施例中，在编码之前对隔行扫描视频帧去隔行。可替换地，两个互补隔行扫描视频场被编码为一个隔行扫描视频帧或单独的场。除了指示逐行扫描视频帧之外，术语“帧”可以指示单个非配对视频场、视频场的互补对、代表给定时间处的视频对象的视频对象平面、或在较大图像中感兴趣的区域。该视频对象平面或区域可以是包含场景的多个对象或区域的较大图像的一个部分。The video source (310) can be a camera, a tuner card, a storage medium, or other digital video source. The video source (310) produces a sequence of video frames at a frame rate of, for example, 30 frames per second. As used herein, the term "frame" generally refers to source, encoded, or reconstructed image data. For progressive scan video, a frame is a progressive scan video frame. For interlaced scan video, in an example embodiment, the interlaced scan video frame is deinterlaced prior to encoding. Alternatively, two complementary interlaced scan video fields are encoded as one interlaced scan video frame or separate fields. In addition to indicating a progressive scan video frame, the term "frame" can indicate a single unpaired video field, a complementary pair of video fields, a video object plane representing a video object at a given time, or a region of interest in a larger image. The video object plane or region can be a portion of a larger image containing multiple objects or regions of a scene.

到达源帧（311）被存储在包含多个帧缓冲存储区（321，322，…32n）的源帧临时存储器存储区（320）中。一个帧缓冲器（321，322等等）保存该源帧存储区（320）中的一个源帧。在已经将一个或多个源帧（311）存储在帧缓冲器（321，322等等）中后，帧选择器（330）周期性地从该源帧存储区（320）中选择一个单独的源帧。该帧选择器（330）选择帧来输入到该编码器（340）的顺序可以与该视频源（310）产生帧的顺序不同，例如一帧的顺序可以在前面来促进时间后向预测。在该编码器（340）前，该编码器系统（300）可以包含在编码前对所选择的帧（331）执行预处理（例如滤波）的预处理器（未示出）。Arriving source frames (311) are stored in a source frame temporary memory storage area (320) comprising a plurality of frame buffer storage areas (321, 322, ... 32n). A frame buffer (321, 322, etc.) stores a source frame in the source frame storage area (320). After one or more source frames (311) have been stored in the frame buffers (321, 322, etc.), a frame selector (330) periodically selects a single source frame from the source frame storage area (320). The order in which the frame selector (330) selects frames to input to the encoder (340) may be different from the order in which the video source (310) generates frames, for example, the order of a frame may be in front to facilitate temporal backward prediction. Prior to the encoder (340), the encoder system (300) may include a preprocessor (not shown) that performs preprocessing (e.g., filtering) on the selected frame (331) before encoding.

该编码器（340）编码所选择的帧（331）来产生编码帧（341）并且还产生存储器管理控制信号（342）。如果当前帧不是已经被编码的第一帧，当执行它的编码过程时，该编码器（340）可以使用已经被存储在解码帧临时存储器存储区（360）中的一个或多个在先的编码/解码帧（369）。这种所存储的解码帧（369）被用作用于当前源帧（331）的内容的帧间预测的参考帧。一般地，该编码器（340）包含多个执行诸如运动估计和补偿、频率变换、量化和熵编码之类的编码任务的编码模块。该编码器（340）执行的准确操作可以根据压缩格式而变化。输出编码数据的格式可以是Windows Media Video格式、VC-1格式、MPEG-x格式（例如MPEG-1、MPEG-2或MPEG-4）、H.26x格式（例如H.261、H.262、H.263、H.264）、HEVC格式或其它格式。The encoder (340) encodes the selected frame (331) to produce an encoded frame (341) and also generates a memory management control signal (342). If the current frame is not the first frame to be encoded, when performing its encoding process, the encoder (340) may use one or more previously encoded/decoded frames (369) that have been stored in a decoded frame temporary memory storage area (360). Such stored decoded frames (369) are used as reference frames for inter-frame prediction of the content of the current source frame (331). Generally, the encoder (340) includes multiple encoding modules that perform encoding tasks such as motion estimation and compensation, frequency conversion, quantization, and entropy encoding. The exact operations performed by the encoder (340) may vary depending on the compression format. The format of the output encoded data may be Windows Media Video format, VC-1 format, MPEG-x format (e.g., MPEG-1, MPEG-2, or MPEG-4), H.26x format (e.g., H.261, H.262, H.263, H.264), HEVC format, or other formats.

编码帧（341）和存储器管理控制信号（342）由解码过程模拟器（350）来处理。该解码过程模拟器（350）实现解码器的一些功能性，例如重建由编码器（340）使用在运动估计和补偿中的参考帧的解码任务。该解码过程模拟器（350）使用该存储器管理控制信号（342）来确定给定编码帧（341）是否需要被重建和存储以用作要被编码的后续帧的帧间预测中的参考帧。如果该控制信号（342）指示需要存储编码帧（341），该解码过程模拟器（350）对解码过程进行建模，该解码过程可以由接收该编码帧（341）并产生对应解码帧（351）的解码器实施。在这样做时，当该编码器（340）已经使用已经被存储在解码帧存储区（360）中的（多个）解码帧（369）时，该解码过程模拟器（350）还使用来自该存储区（360）的该（多个）解码帧（369），作为该解码过程的一部分。The coded frame (341) and the memory management control signal (342) are processed by a decoding process simulator (350). The decoding process simulator (350) implements some functionality of the decoder, such as the decoding task of reconstructing reference frames used by the encoder (340) in motion estimation and compensation. The decoding process simulator (350) uses the memory management control signal (342) to determine whether a given coded frame (341) needs to be reconstructed and stored for use as a reference frame in inter-frame prediction of subsequent frames to be encoded. If the control signal (342) indicates that the coded frame (341) needs to be stored, the decoding process simulator (350) models a decoding process that can be implemented by a decoder that receives the coded frame (341) and produces a corresponding decoded frame (351). In doing so, when the encoder (340) has used decoded frame(s) (369) already stored in the decoded frame storage area (360), the decoding process simulator (350) also uses the decoded frame(s) (369) from the storage area (360) as part of the decoding process.

该解码帧临时存储器存储区（360）包含多个帧缓冲存储区（361，362，…，36n）。该解码过程模拟器（350）使用该存储器管理控制信号（342）来管理该存储区（360）的内容，以便识别任何具有不再被编码器（340）需要用作参考帧的帧的帧缓冲器（361，362等等）。在对该解码过程建模后，该解码过程模拟器（350）存储以这种方式已经被识别的帧缓冲器（361，362等等）中的新近解码的帧（351）。The decoded frame temporary memory storage area (360) includes a plurality of frame buffer storage areas (361, 362, ..., 36n). The decoding process simulator (350) uses the memory management control signal (342) to manage the contents of the storage area (360) to identify any frame buffers (361, 362, etc.) that contain frames that are no longer needed as reference frames by the encoder (340). After modeling the decoding process, the decoding process simulator (350) stores the newly decoded frames (351) in the frame buffers (361, 362, etc.) that have been identified in this manner.

该编码帧（341）和存储器管理控制信号（342）也在临时编码数据区（370）中被缓冲。在该编码数据区（370）中聚集的编码数据可以含有作为基本编码视频比特流的语法的一部分的一个或多个指示对延迟的约束的语法元素。或者，聚集在该编码数据区（370）中的该编码数据可以包含作为与该编码视频数据有关的媒体元数据的一部分（例如作为一条或多条补充增强信息（“SEI”）消息或视频可用信息（“VUI”）消息中的一个或多个参数）的指示对延迟的约束的（多个）语法元素。The coded frames (341) and memory management control signals (342) are also buffered in a temporary coded data area (370). The coded data accumulated in the coded data area (370) may contain one or more syntax elements indicating a constraint on delay as part of the syntax of the elementary coded video bitstream. Alternatively, the coded data accumulated in the coded data area (370) may contain syntax element(s) indicating a constraint on delay as part of media metadata associated with the coded video data (e.g., as one or more parameters in one or more Supplemental Enhancement Information ("SEI") messages or Video Usability Information ("VUI") messages).

来自该临时编码数据区（370）的聚集数据（371）由信道编码器（380）处理。该信道编码器（380）可以将被聚集的数据打包以便作为媒体流传输，在这种情况下该信道编码器（380）可以添加作为该媒体传输流的语法的一部分的指示对延迟的约束的（多个）语法元素。或者该信道编码器（380）可以组织被聚集的数据以便作为文件存储，在这种情况下该信道编码器（380）可以添加作为该媒体存储文件的语法的一部分的指示对延迟的约束的（多个）语法元素。或者，更一般地，该信道编码器（380）可以实现一个或多个媒体系统多路复用协议或传输协议，在这种情况下该信道编码器（380）可以添加作为该（多个）协议的语法的一部分的指示对延迟的约束的（多个）语法元素。该信道编码器（380）提供到信道（390）的输出，该信道（390）代表存储、通信连接或用于该输出的另一信道。Aggregated data (371) from the temporary coded data area (370) is processed by a channel encoder (380). The channel encoder (380) may package the aggregated data for transmission as a media stream, in which case the channel encoder (380) may add (multiple) syntax elements indicating the constraints on delay as part of the syntax of the media transmission stream. Or the channel encoder (380) may organize the aggregated data for storage as a file, in which case the channel encoder (380) may add (multiple) syntax elements indicating the constraints on delay as part of the syntax of the media storage file. Or, more generally, the channel encoder (380) may implement one or more media system multiplexing protocols or transport protocols, in which case the channel encoder (380) may add (multiple) syntax elements indicating the constraints on delay as part of the syntax of the protocol(s). The channel encoder (380) provides an output to a channel (390), which represents storage, a communication connection, or another channel for the output.

IV. 示例解码器系统IV. Example Decoder System

图4是与之结合可以实现一些所描述的实施例的示例解码器系统（400）的框图。该解码器系统（400）可以是通用解码工具，其能够工作在诸如用于实时通信的低延迟解码模式和用于回放来自文件或流的媒体的常规解码模式之类的多个解码模式的任何一个中，或者它可以是适配用于一个这样的解码模式的专用解码工具。该解码器系统（400）可以被实现为操作系统模块，作为应用库的一部分或者作为独立的应用。大体来说，该解码器系统（400）接收来自信道（410）的编码数据并且产生作为针对输出目的地（490）的输出的重建帧。该编码数据可以包含一个或多个指示对延迟（例如帧重排序延迟）的约束的语法元素来促进降低延迟的解码。FIG4 is a block diagram of an example decoder system (400) in conjunction with which some of the described embodiments may be implemented. The decoder system (400) may be a general purpose decoding tool capable of operating in any of a plurality of decoding modes, such as a low-latency decoding mode for real-time communications and a conventional decoding mode for playback of media from a file or stream, or it may be a dedicated decoding tool adapted for use in one such decoding mode. The decoder system (400) may be implemented as an operating system module, as part of an application library, or as a standalone application. In general, the decoder system (400) receives encoded data from a channel (410) and produces reconstructed frames as output for an output destination (490). The encoded data may contain one or more syntax elements indicating constraints on delay (e.g., frame reordering delay) to facilitate reduced-latency decoding.

该解码器系统（400）包含信道（410），其可以代表存储、通信连接、或用于作为输入的编码数据的另一信道。该信道（410）产生已经被信道编码的编码数据。信道解码器（420）可以处理该编码数据。例如，该信道解码器（420）将已经被聚集用于作为媒体流传输的数据拆包，在这种情况下该信道解码器（420）可以解析作为该媒体传输流的语法的一部分的指示对延迟的约束的（多个）语法元素。或者，该信道解码器（420）分离已经被聚集来用于作为文件存储的编码视频数据，在这种情况下该信道解码器（420）可以解析作为该媒体存储文件的语法的一部分的指示对延迟的约束的（多个）语法元素。或者，更一般地，该信道解码器（420）可以实现一个或多个媒体系统多路解复用协议或传输协议，在这种情况下该信道解码器（420）可以解析作为该（多个）协议的语法的一部分的指示对延迟的约束的（多个）语法元素。The decoder system (400) includes a channel (410), which can represent storage, a communication connection, or another channel for coded data as input. The channel (410) produces coded data that has been channel-coded. A channel decoder (420) can process the coded data. For example, the channel decoder (420) depacketizes data that has been aggregated for transmission as a media stream, in which case the channel decoder (420) can parse (multiple) syntax elements indicating delay constraints as part of the syntax of the media transmission stream. Alternatively, the channel decoder (420) depacketizes coded video data that has been aggregated for storage as a file, in which case the channel decoder (420) can parse (multiple) syntax elements indicating delay constraints as part of the syntax of the media storage file. Alternatively, more generally, the channel decoder (420) can implement one or more media system demultiplexing protocols or transport protocols, in which case the channel decoder (420) can parse (multiple) syntax elements indicating delay constraints as part of the syntax of the protocol(s).

从该信道解码器（420）输出的编码数据（421）被存储在临时编码数据区（430）直到已经接收到充足量的这种数据。该编码数据（421）包含编码帧（431）和存储器管理控制信号（432）。在该编码数据区（430）中的该编码数据（421）可以含有作为基本编码视频比特流的语法的一部分的指示对延迟的约束的一个或多个语法元素。或者，在该编码数据区（430）中的该编码数据（421）可以包含作为与编码视频数据有关的媒体元数据的一部分（例如，作为在一条或多条SEI消息或VUI消息中的一个或多个参数）的指示对延迟的约束的（多个）语法元素。一般地，该编码数据区（430）临时存储编码数据（421）直到解码器（450）使用这样的编码数据（421）。在那时，将用于该编码帧（431）和存储器管理控制信号（432）的编码数据从该编码数据区（430）传递到该解码器（450）。随着解码的继续，新的编码数据被添加到该编码数据区（430）并且剩余在该编码数据区（430）中的最老的编码数据被传递到该解码器（450）。The coded data (421) output from the channel decoder (420) is stored in a temporary coded data area (430) until a sufficient amount of such data has been received. The coded data (421) includes coded frames (431) and memory management control signals (432). The coded data (421) in the coded data area (430) may contain one or more syntax elements indicating constraints on delay as part of the syntax of the elementary coded video bitstream. Alternatively, the coded data (421) in the coded data area (430) may contain (a plurality of) syntax elements indicating constraints on delay as part of media metadata related to the coded video data (e.g., as one or more parameters in one or more SEI messages or VUI messages). Generally, the coded data area (430) temporarily stores coded data (421) until the decoder (450) uses such coded data (421). At that time, the coded data for the coded frame (431) and the memory management control signal (432) are passed from the coded data area (430) to the decoder (450). As decoding continues, new coded data is added to the coded data area (430) and the oldest coded data remaining in the coded data area (430) is passed to the decoder (450).

该解码器（450）周期性地解码编码帧（431）来产生对应的解码帧（451）。如果合适的话，当执行它的解码过程时，该解码器（450）可以使用一个或多个在先的解码帧（469）作为用于帧间预测的参考帧。该解码器（450）从解码帧临时存储器存储区（460）读取这样的在先解码帧（469）。一般地，该解码器（450）包含执行诸如熵解码、逆量化、频率逆变换和运动补偿之类的解码任务的多个解码模块。该解码器（450）执行的准确操作可以根据压缩格式而变化。The decoder (450) periodically decodes coded frames (431) to produce corresponding decoded frames (451). If appropriate, when performing its decoding process, the decoder (450) may use one or more previously decoded frames (469) as reference frames for inter-frame prediction. The decoder (450) reads such previously decoded frames (469) from a decoded frame temporary memory storage area (460). Generally, the decoder (450) includes a plurality of decoding modules that perform decoding tasks such as entropy decoding, inverse quantization, inverse frequency transform, and motion compensation. The exact operations performed by the decoder (450) may vary depending on the compression format.

该解码帧临时存储器存储区（460）包含多个帧缓冲存储区（461，462，…，46n）。该解码帧存储区（460）是解码图像缓冲器的一个示例。该解码器（450）使用该存储器管理控制信号（432）来识别它可以在其中存储解码帧（451）的帧缓冲器（461，462等等）。该解码器（450）在该帧缓冲器中存储解码帧（451）。The decoded frame temporary memory storage area (460) includes a plurality of frame buffer storage areas (461, 462, ..., 46n). The decoded frame storage area (460) is an example of a decoded image buffer. The decoder (450) uses the memory management control signal (432) to identify a frame buffer (461, 462, etc.) in which it can store the decoded frame (451). The decoder (450) stores the decoded frame (451) in the frame buffer.

输出定序器（sequencer）（480）使用该存储器管理控制信号（432）来识别按照输出顺序要产生的下一帧何时在该解码帧存储区（460）中可用。为了降低该编码-解码系统的延迟，该输出定序器（480）使用指示对延迟的约束的语法元素来加快按照输出顺序要产生的帧的识别。当按照输出顺序要产生的下一帧（480）在该解码帧存储区（460）中可用时，它被该输出定序器（480）读取并被输出到输出目的地（490）（例如显示器）。一般地，该输出定序器（480）从该解码帧存储区（460）输出帧的顺序可以不同于该解码器（450）解码帧的顺序。An output sequencer (480) uses the memory management control signal (432) to identify when the next frame to be generated in output order is available in the decoded frame storage area (460). To reduce the latency of the encoding-decoding system, the output sequencer (480) uses syntax elements that indicate constraints on latency to speed up the identification of frames to be generated in output order. When the next frame to be generated in output order (480) is available in the decoded frame storage area (460), it is read by the output sequencer (480) and output to an output destination (490) (e.g., a display). Generally, the order in which the output sequencer (480) outputs frames from the decoded frame storage area (460) can be different from the order in which the frames are decoded by the decoder (450).

V. 促进降低延迟的编码和解码的语法元素V. Syntax Elements to Facilitate Low-Latency Encoding and Decoding

在大多数视频编解码器系统中，编码顺序（也被称作解码顺序或比特流顺序）是在比特流中的编码数据中表示视频帧并且因此在解码期间处理视频帧的顺序。该编码顺序可以不同于编码前摄像机捕捉帧的顺序并且不同于解码后显示、存储或者以它方式输出解码帧的顺序（输出顺序或显示顺序）。与输出顺序有关的帧重排序是有益处的（主要在压缩能力方面），但是它提高了编码和解码过程的端到端的延迟。In most video codec systems, encoding order (also known as decoding order or bitstream order) is the order in which video frames are represented in the coded data in the bitstream and, therefore, processed during decoding. This encoding order can differ from the order in which frames were captured by the camera before encoding and from the order in which decoded frames are displayed, stored, or otherwise output after decoding (output order or display order). Reordering frames relative to output order can have benefits (primarily in terms of compression capabilities), but it increases the end-to-end latency of the encoding and decoding process.

本文描述的技术和工具降低了由于视频帧的重排序导致的延迟，并且，通过向解码器系统提供关于对重排序延迟的约束的信息，还促进了该解码器系统的延迟降低。这样的延迟降低出于许多目的都是有用的。例如，它可以被用于降低发生在使用视频会议系统的交互式视频通信中的时滞，从而使得远程参与者之间的对话流和通信交互更加迅速和自然。The techniques and tools described herein reduce the delay caused by the reordering of video frames and, by providing information to a decoder system regarding constraints on the reordering delay, further facilitate latency reduction in the decoder system. Such latency reduction is useful for many purposes. For example, it can be used to reduce the time lag that occurs in interactive video communications using video conferencing systems, thereby making the flow of conversation and communication interactions between remote participants more rapid and natural.

A. 输出时序和输出排序的方法A. Output Timing and Output Sorting Methods

根据H.264标准，解码器可以使用两种方法来确定解码帧何时准备好被输出。解码器可以使用以解码时间戳和输出时间戳形式的时序信息（例如在图像时序SEI消息中用发信号通知）。或者，该解码器可以使用利用各种语法元素发信号通知的缓冲容量限制来确定解码帧何时准备好被输出。According to the H.264 standard, a decoder can use two methods to determine when a decoded frame is ready to be output. The decoder can use timing information in the form of a decoding timestamp and an output timestamp (e.g., signaled in a picture timing SEI message). Alternatively, the decoder can use buffer capacity limits signaled using various syntax elements to determine when a decoded frame is ready to be output.

时序信息可以与每个解码帧相关联。该解码器可以使用时序信息来确定解码帧何时可以被输出。但是，在实际上，这样的时序信息对于解码器来说可能是不可用的。而且，甚至当时序信息可用时，一些解码器实际上并不使用该信息（例如因为解码器已经被设计成不论时序信息是否可用都工作）。Timing information can be associated with each decoded frame. The decoder can use this timing information to determine when the decoded frame can be output. However, in practice, such timing information may not be available to the decoder. Moreover, even when timing information is available, some decoders do not actually use it (for example, because the decoder is designed to operate regardless of whether timing information is available).

根据H.264标准（以及HEVC标准的草稿版本）缓冲容量限制由若干语法元素指示，包含语法元素max_dec_frame_buffering、语法元素num_reorder_frames、相关的排序信息（称作“图像顺序计数”信息）和其它在比特流中发信号通知的存储器管理控制信息。语法元素max_dec_frame_buffering（或被指定为MaxDpbFrames的衍生变量）指定在帧缓冲器单元中所需的解码图像缓冲器（“DPB”）尺寸。同样地，语法元素max_dec_frame_buffering表示用于编码视频序列的顶级存储器容量，以便使得解码器能够以正确的顺序输出图像。语法元素num_reorder_frames（或者max_num_reorder_frames）指示按照编码顺序能够先于任何帧（或互补场对，或非配对场）并且按照输出顺序跟随该任何帧的帧（或互补场对，或非配对场）的最大数目。换言之，num_reorder_frames指定了对图像重排序所必需的存储器容量的约束。语法元素max_num_ref_frames 指定了可以被解码过程用于该序列中任何图像的帧间预测的短期和长期参考帧（或互补参考场对，或非配对参考场）的最大数目。语法元素max_num_ref_frames还确定了用于解码参考图像标记的滑动窗口的尺寸。像num_reorder_frames一样，max_num_ref_frames指定了对所需存储器容量的约束。According to the H.264 standard (and draft versions of the HEVC standard), buffer capacity limits are indicated by several syntax elements, including the syntax element max_dec_frame_buffering, the syntax element num_reorder_frames, related ordering information (called "picture order count"), and other memory management control information signaled in the bitstream. The syntax element max_dec_frame_buffering (or its derivative, designated as MaxDpbFrames) specifies the required decoded picture buffer (DPB) size in frame buffer units. Similarly, the syntax element max_dec_frame_buffering indicates the top-level memory capacity used to encode a video sequence so that the decoder can output pictures in the correct order. The syntax element num_reorder_frames (or max_num_reorder_frames) indicates the maximum number of frames (or complementary field pairs, or unpaired fields) that can precede any frame (or complementary field pair, or unpaired fields) in coding order and follow it in output order. In other words, num_reorder_frames specifies a constraint on the memory capacity required for picture reordering. The syntax element max_num_ref_frames specifies the maximum number of short-term and long-term reference frames (or complementary reference field pairs, or unpaired reference fields) that can be used by the decoding process for inter prediction of any picture in the sequence. The syntax element max_num_ref_frames also determines the size of the sliding window used for decoding reference picture marking. Like num_reorder_frames, max_num_ref_frames specifies a constraint on the amount of memory required.

解码器使用max_dec_frame_buffering（或MaxDpbFrames）和num_reorder_frames语法元素来确定何时超出缓冲容量限制。这发生在例如当新的解码帧需要被存储在DPB中，但是该DPB中没有剩余的可用空间的时候。在这种情形下，该解码器使用图像顺序计数信息来识别在已经被解码的图像中哪个按照输出顺序最早。按照输出顺序最早的图像之后被输出。这样的处理有时也被称作“碰撞（bumping）”，因为需要被存储的新图像的到来将一幅图像“碰撞出”该DPB。The decoder uses the max_dec_frame_buffering (or MaxDpbFrames) and num_reorder_frames syntax elements to determine when the buffer capacity limit has been exceeded. This occurs, for example, when a new decoded frame needs to be stored in the DPB, but there is no free space left in the DPB. In this case, the decoder uses the picture order count information to identify which of the already decoded pictures is oldest in output order. The oldest pictures in output order are output last. This process is sometimes called "bumping" because the arrival of a new picture that needs to be stored "bumps" a picture out of the DPB.

由max_dec_frame_buffering（或MaxDpbFrames）和num_reorder_frames语法元素指示的信息足够用于确定在解码器中需要的存储器容量。但是，当被用于控制用于图像输出的“碰撞”过程时，这种信息的使用会引入不必要的延迟。正如在H.264标准中定义的那样，max_dec_frame_buffering和num_reorder_frames语法元素没有建立对可以被用于任何特定图像的重排序数量的限制并且，因此，没有建立对端到端的延迟的限制。不论这些语法元素的值是多少，可以在输出一幅特定的图像前将其保留在DPB中任意长的时间，这对应于由编码器对源图像预缓冲而添加的大量延迟。The information indicated by the max_dec_frame_buffering (or MaxDpbFrames) and num_reorder_frames syntax elements is sufficient to determine the amount of memory required in the decoder. However, the use of this information can introduce unnecessary delay when used to control the "bumping" process for picture output. As defined in the H.264 standard, the max_dec_frame_buffering and num_reorder_frames syntax elements do not establish a limit on the number of reorderings that can be used for any particular picture and, therefore, do not establish a limit on the end-to-end delay. Regardless of the value of these syntax elements, a particular picture can remain in the DPB for an arbitrarily long time before being output, which corresponds to a significant amount of delay added by the encoder's pre-buffering of source pictures.

B. 指示对帧重排序延迟的约束的语法元素B. Syntax Elements Indicating Constraints on Frame Reordering Delays

本文描述的技术和工具降低了视频通信系统中的延迟。编码工具、实时通信工具，或其它工具设置了对可以被用于编码视频序列中的任何帧的重排序范围的限制。例如，可以将该限制表示为在编码视频序列中按照输出顺序可以先于任何给定帧并且按照编码顺序跟随该给定帧的帧的数目。该限制约束了对于该序列中的任何特定帧所允许的重排序延迟。换言之，该限制约束了可以被用于任何特定帧的编码顺序和输出顺序之间的重排序的时间范围（以帧为单位）。限制重排序的范围帮助降低端到端的延时。同样地，建立这样的限制在实时系统协商协议或用于其中降低延迟非常重要的使用情境的应用规范中可以是有用的。The techniques and tools described herein reduce latency in video communication systems. Coding tools, real-time communication tools, or other tools set limits on the range of reordering that can be used for any frame in a coded video sequence. For example, the limit can be expressed as the number of frames in the coded video sequence that can precede any given frame in output order and follow the given frame in coding order. The limit constrains the reordering delay allowed for any particular frame in the sequence. In other words, the limit constrains the time range (in frames) that can be used for reordering between the coding order and the output order of any particular frame. Limiting the range of reordering helps reduce end-to-end latency. Similarly, establishing such limits can be useful in real-time system negotiation protocols or application specifications for use scenarios where reducing latency is very important.

一个或多个语法元素指示对帧重排序延迟的约束。发信号通知对帧重排序延迟的约束促进用于交互式实时通信或其它使用情境的系统级协商。它提供了一种直接表示对帧重排序延迟的约束并且表征媒体流或会话的特性的方式。One or more syntax elements indicate constraints on frame reordering delay. Signaling constraints on frame reordering delay facilitates system-level negotiation for interactive real-time communication or other use cases. It provides a way to directly express constraints on frame reordering delay and characterize the characteristics of a media stream or session.

视频解码器可以使用所指示的对帧重排序延迟的约束来使得能够实现解码视频帧的降低延迟的输出。具体地，相比于帧“碰撞”过程，发信号通知对帧重排序延迟的约束使得解码器能够更加简单和快速地识别DPB中准备好被输出的帧。例如，解码器可以通过计算针对该帧的编码顺序与输出顺序之间的差来确定DPB中帧的延迟状态。通过将该帧的延迟状态与对帧重排序延迟的约束相比较，解码器可以确定何时已经达到对帧重排序延迟的约束。该解码器可以立即输出已经达到该限制的任何帧。相比于使用多种语法元素和跟踪结构的“碰撞”过程，这可以帮助该解码器更加迅速地识别为输出做好准备的帧。这样，该解码器可以迅速地（并且更早地）确定何时解码帧可以被输出。该解码器越快（并且越早）识别何时帧可以被输出，该解码器可以越快（并且越早）将视频输出到显示器或后续处理阶段。A video decoder can use the indicated constraints on frame reordering delay to enable reduced-latency output of decoded video frames. Specifically, compared to a frame "bumping" process, signaling constraints on frame reordering delay enables the decoder to more easily and quickly identify frames in the DPB that are ready for output. For example, the decoder can determine the delay status of a frame in the DPB by calculating the difference between the encoding order and the output order for that frame. By comparing the delay status of the frame with the constraints on frame reordering delay, the decoder can determine when the constraints on frame reordering delay have been met. The decoder can immediately output any frames that have reached this limit. This helps the decoder identify frames that are ready for output more quickly than a "bumping" process that uses multiple syntax elements and tracking structures. As a result, the decoder can quickly (and earlier) determine when a decoded frame can be output. The faster (and earlier) the decoder identifies when a frame can be output, the faster (and earlier) the decoder can output video to a display or subsequent processing stage.

因此，通过使用对帧重排序延迟的约束，解码器可以在解码帧存储区满之前开始输出来自该解码帧存储区的帧，但是仍然提供符合标准的解码（即解码所有帧从而使得该帧与使用另外的常规方案解码的帧比特精确匹配）。当由延迟语法元素指示的延时（以帧为单位）远远小于该解码帧存储区的尺寸（以帧为单位）时，这显著地降低了延时。Thus, by using a constraint on the frame reordering delay, the decoder can start outputting frames from the decoded frame store before it is full, but still provide standard-compliant decoding (i.e., decoding all frames so that they are bit-exactly matched to frames decoded using an otherwise conventional scheme). This significantly reduces the delay when the delay (in frames) indicated by the delay syntax element is much smaller than the size (in frames) of the decoded frame store.

图5a-5e图示了具有不同帧间相关性的帧的系列（501-505）。该系列通过以下约束的不同值来表征：（1）对图像重排序必需的存储器容量的约束（也就是，用于存储出于重排序目的的参考帧的帧缓冲器的数目，例如，由语法元素num_reorder_frames），以及（2）对帧重排序延迟的约束，例如由变量MaxLatencyFrames指定。在图5a-5e中，对于给定的帧F_j ^k，下标j指示该帧按照输出顺序的位置并且上标k指示该帧按照编码顺序的位置。按照输出顺序示出各帧——输出顺序下标值从左至右增加。箭头图示了用于运动补偿的帧间相关性，根据该帧间相关性使用按照编码顺序在先的帧来预测按照编码顺序的后续帧。出于简化的目的，图5a-5e示出了帧级的帧间相关性（而不是参考帧可能改变的宏块级别、块级别等等），并且图5a-5e示出了对于一个给定帧至多两帧作为参考帧。实际上，在一些实现中，在给定帧中不同的宏块、块等等可以使用不同的参考帧，并且对于该给定帧可以使用多于两个参考帧。5a-5e illustrate a series of frames (501-505) with different inter-frame dependencies. The series is characterized by different values of the following constraints: (1) a constraint on the memory capacity required for image reordering (i.e., the number of frame buffers used to store reference frames for reordering purposes, for example, by the syntax element num_reorder_frames), and (2) a constraint on the frame reordering delay, for example, specified by the variable MaxLatencyFrames. In FIG5a-5e, for a given frame _Fjk ^, the subscript j indicates the position of the frame in output order and the superscript k indicates the position of the frame in coding order. The frames are shown in output order - the output order subscript values increase from left to right. The arrows illustrate the inter-frame dependencies used for motion compensation, according to which the preceding frame in coding order is used to predict the subsequent frame in coding order. For simplicity, Figures 5a-5e illustrate inter-frame correlation at the frame level (rather than at the macroblock level, block level, etc., where the reference frame may change), and Figures 5a-5e illustrate that at most two frames are used as reference frames for a given frame. In practice, in some implementations, different macroblocks, blocks, etc. in a given frame may use different reference frames, and more than two reference frames may be used for a given frame.

在图5a中，系列（501）包含九帧。按照输出顺序的最后一帧F₈ ¹使用第一帧F₀ ⁰作为参考帧。在该系列（501）中的其它帧使用最后一帧F₈ ¹和第一帧F₀ ⁰二者作为参考帧。这意味着最先解码帧F₀ ⁰，然后是帧F₈ ¹，然后是帧F₁ ²，等等。在图5a中示出的系列（501）中，num_reorder_frames的值是1。在该解码器系统处理中的任何点处，在图5a中示出的各帧中，只有一帧（F₈ ¹）出于重排序的目的被存储在解码帧存储区中。（第一帧F₀ ⁰也被用作参考帧并且被存储，但是不是出于重排序的目的被存储。因为第一帧F₀ ⁰的输出顺序小于中间帧的输出顺序，因此不把第一帧F₀ ⁰算入出于num_reorder_frames的目的。）尽管num_reorder_frames的值较小，该系列（501）具有相对较高的延迟——MaxLatencyFrames的值为7。在编码第一帧F₀ ⁰之后，该编码器等待，直到在编码按照输出顺序的下一帧F₁ ²之前它已经缓冲了另八个源帧，这是因为下一帧F₁ ²取决于该系列（501）中的最后帧F₈ ¹。MaxLatencyFrames的值实际上是对于任何特定编码帧的下标值与上标值之间允许的最大差。In FIG5 a, ^a sequence (501) contains nine frames. The last frame _F81 in output order uses the first ^frame _F00 as a reference frame. The other frames in the sequence (501) use both the last ^frame _F81 and ^the first frame _F00 as reference frames. This means that frame _F00 ^is decoded first, followed by frame _F81 , then ^frame _F12 , and so on. In the sequence (501) shown in FIG5 a, the value of num_reorder_frames is 1. At any point in the processing of the decoder system, ^only one frame ( _F81 ) of the frames shown in ^FIG5 a is stored in the decoded frame store for reordering purposes . (The first frame F ₀ ⁰ is also used as a reference frame and is stored, but not for reordering purposes. Because the output order of the first frame F ₀ ⁰ is less than the output order of the intermediate frames, the first frame F ₀ ⁰ is not counted for the purposes of num_reorder_frames.) Despite the small value of num_reorder_frames, the series (501) has a relatively high latency - the value of MaxLatencyFrames is 7. After encoding the first frame F ₀ ⁰ , the encoder waits until it has buffered another eight source frames before encoding the next frame F ₁ ² in output order because the next frame F ₁ ² depends on the last frame F ₈ ¹ in the series (501). The value of MaxLatencyFrames is actually the maximum allowed difference between the subscript value and the superscript value for any particular coded frame.

在图5b中，系列（502）包含九帧，像图5a中的系列（501）一样，但是帧间相关性是不同的。帧的时间重排序发生在短范围中。结果，该系列（502）具有低得多的延迟——MaxLatencyFrames的值为1。num_reorder_frames的值仍然是1。In Figure 5b, series (502) contains nine frames, just like series (501) in Figure 5a, but the inter-frame correlation is different. Temporal reordering of frames occurs over a short timeframe. As a result, this series (502) has a much lower latency—the value of MaxLatencyFrames is 1. The value of num_reorder_frames remains 1.

在图5c中，系列（503）包含十帧。最长的帧间相关性（在时间范围中）比图5a中最长的帧间相关性短，但是长于图5b中最长的帧间相关性。该系列（503）对于num_reorder_frames具有相同的低值1，并且它对于MaxLatencyFrames具有相对低的值2。该系列（503）因此允许比图5a的系列（501）更低的端到端的延迟，尽管没有图5b的系列（501）能允许的延迟低。In Figure 5c, series (503) contains ten frames. The longest inter-frame correlation (in time) is shorter than the longest inter-frame correlation in Figure 5a, but longer than the longest inter-frame correlation in Figure 5b. This series (503) has the same low value of 1 for num_reorder_frames, and it has a relatively low value of 2 for MaxLatencyFrames. This series (503) therefore allows for lower end-to-end latency than series (501) in Figure 5a, although not as low as the latency allowed by series (501) in Figure 5b.

在图5d中，系列（504）包含根据帧间相关性组织在具有三个时间层的时间层次中的帧。最低时间分辨率层包含第一帧F₀ ⁰和最后一帧F₈ ¹。下一层时间分辨率层添加帧F₄ ²，其取决于第一帧F₀ ⁰和最后一帧F₈ ¹。最高时间分辨率层添加其余帧。图5d中示出的系列（504）对于num_reorder_frames具有相对低的值2但是对于MaxLatencyFrames具有相对高的值7，至少对于最高时间分辨率层来说是这样，这是由于最后一帧F₈ ¹的编码顺序与输出顺序之间的差导致的。如果只解码中间时间分辨率层或最低时间分辨率层，那么对帧重排序延时的约束可以被降低到1（对于该中间层）或0（对于该最低层）。为了促进在各种时间分辨率中的降低延迟的解码，语法元素可以指示针对时间层次中的不同层的对帧重排序延迟的约束。In FIG5 d , a sequence ( 504 ) comprises frames organized in a temporal hierarchy having three temporal layers according to inter-frame dependencies. The lowest temporal resolution layer comprises a first frame F ₀ ⁰ and a last frame F ₈ ¹ . The next temporal resolution layer adds frame F ₄ ² , which depends on the first frame F ₀ ⁰ and the last frame F ₈ ¹ . The highest temporal resolution layer adds the remaining frames. The sequence ( 504 ) shown in FIG5 d has a relatively low value of 2 for num_reorder_frames but a relatively high value of 7 for MaxLatencyFrames , at least for the highest temporal resolution layer, due to the difference between the encoding order and the output order of the last frame F ₈ ¹ . If only an intermediate temporal resolution layer or the lowest temporal resolution layer is decoded, the constraint on the frame reordering delay can be reduced to 1 (for the intermediate layer) or 0 (for the lowest layer). To facilitate reduced-delay decoding in various temporal resolutions, syntax elements can indicate the constraints on the frame reordering delay for different layers in the temporal hierarchy.

在图5e中，系列（505）包含根据不同帧间相关性组织在具有三层时间层的时间层次中的帧。最低时间分辨率层包含第一帧F₀ ⁰、中间帧F₄ ¹和最后一帧F₈ ⁵。下一层时间分辨率层添加帧F₂ ²（其取决于第一帧F₀ ⁰和中间帧F₄ ¹）和F₆ ⁶（其取决于中间帧F₄ ¹和最后一帧F₈ ⁵）。最高时间分辨率层添加其余的帧。相比于图5d的系列（504），图5e的系列（505）对于num_reorder_frames仍然具有相对低的值2但是对于MaxLatencyFrames具有较低的值3，至少对于最高时间分辨率层来说是这样，这是由于中间帧F₄ ¹和最后一帧F₈ ⁵的编码顺序与输出顺序之间的差导致的。如果只解码中间时间分辨率层或最低时间分辨率层，那么对帧重排序延时的约束可以被降低到1（对于该中间层）或0（对于该最低层）。In FIG5e , the series ( 505 ) contains frames organized in a temporal hierarchy with three temporal layers according to different inter-frame dependencies. The lowest temporal resolution layer contains the first frame F ₀ ⁰ , the middle frame F ₄ ¹ , and the last frame F ₈ ⁵ . The next temporal resolution layer adds frames F ₂ ² (which depends on the first frame F ₀ ⁰ and the middle frame F ₄ ¹ ) and F ₆ ⁶ (which depends on the middle frame F ₄ ¹ and the last frame F ₈ ⁵ ). The highest temporal resolution layer adds the remaining frames. Compared to the series ( 504 ) of FIG5d , the series ( 505 ) of FIG5e still has a relatively low value of 2 for num_reorder_frames but a lower value of 3 for MaxLatencyFrames, at least for the highest temporal resolution layer, due to the difference between the encoding order and the output order of the middle frame F ₄ ¹ and the last frame F ₈ ⁵ . If only the middle temporal resolution layer or the lowest temporal resolution layer is decoded, the constraint on the frame reordering delay can be reduced to 1 (for the middle layer) or 0 (for the lowest layer).

在图5a-5e示出的示例中，如果已知MaxLatencyFrames的值，则解码器可以识别准备好一旦接收到按照输出顺序的在先帧就立即被输出的某些帧。对于一个给定帧，该帧的输出顺序值减去该帧的编码顺序值可以等于MaxLatencyFrames的值。在这种情况下，只要按照输出顺序该给定帧的在先帧被接收到，该给定帧就为输出做好准备。（相反地，单独使用num_reorder_frames，直到附加帧被接收到或者到达该序列的尽头前，这样的帧都不能被识别为为输出做好准备。）具体地，解码器可以使用MaxLatencyFrames的值来使得能够较早地输出下列帧：In the example shown in Figures 5a-5e, if the value of MaxLatencyFrames is known, the decoder can identify certain frames that are ready to be output as soon as the previous frame in output order is received. For a given frame, the output order value of the frame minus the coding order value of the frame can be equal to the value of MaxLatencyFrames. In this case, as soon as the previous frame in output order is received, the given frame is ready for output. (In contrast, using num_reorder_frames alone, such a frame cannot be identified as ready for output until additional frames are received or the end of the sequence is reached.) Specifically, the decoder can use the value of MaxLatencyFrames to enable early output of the following frames:

·在图5a的系列（501）中，帧F₈ ¹。• In the series ( ⁵⁰¹ ) of Figure 5a, frame _F81 .

·在图5b的系列（502）中，帧F₂ ¹、F₄ ³、F₆ ⁵和F₈ ⁷。• In ^the series (502) of Figure 5b, ^{frames F21} _, _F43 , _F65 ^and _F87 ^.

·在图5c的系列（503）中，帧F₃ ¹、F₆ ⁴和F₉ ⁷。• In ^the series (503) of Figure 5c, frames _F31 , _F64 ^and _F97 ^.

·在图5d的系列（504）中，帧F₈ ¹。• In ^the series (504) of Figure 5d, frame _F81 .

·在图5e的系列（505）中，帧F₄ ¹和F₈ ⁵。• In _the series (505) of Figure 5e, ^frames _F41 and ^F85 .

另外，在系统级声明或协商MaxLatencyFrames的值可以以一种测量重排序存储容量并使用num_reorder_frames来指示该容量所不能实现的方式提供对比特流或会话的延迟特性的概要表示。Additionally, declaring or negotiating a value for MaxLatencyFrames at the system level may provide a high-level representation of the latency characteristics of a bitstream or session in a way that is not achievable by measuring reorder storage capacity and using num_reorder_frames to indicate that capacity.

C. 示例实现C. Example Implementation

指示对帧重排序延迟的约束的语法元素可以以各种方式来发信号通知，这取决于实现。该语法元素可以作为序列参数集（“SPS”）、图像参数集（“PPS”）的一部分，或该比特流的其它元素来发信号通知，作为SEI消息、VUI消息或其它元数据的一部分发信号通知，或者以一些其它方式发信号通知。在任何一种实现中，可以通过使用无符号的指数哥伦布编码（unsigned exponential-Golomb coding）、一些其它形式的熵编码，或固定长度编码来编码并且之后发信号通知指示约束值的语法元素。解码器在接收该语法元素后执行对应的解码。The syntax element indicating the constraint on the frame reordering delay can be signaled in various ways, depending on the implementation. The syntax element can be signaled as part of a sequence parameter set ("SPS"), a picture parameter set ("PPS"), or other elements of the bitstream, as part of an SEI message, a VUI message, or other metadata, or in some other way. In any implementation, the syntax element indicating the constraint value can be encoded using unsigned exponential-Golomb coding, some other form of entropy coding, or fixed-length coding and then signaled. The decoder performs the corresponding decoding upon receiving the syntax element.

在第一种实现中，发信号通知标记max_latency_limitation_flag。如果该标记具有第一个二进制值（例如0），则不对帧重排序延迟施加任何约束。在这种情况下，不发信号通知max_latency_frames语法元素的值或者将其忽略。否则（该标记具有第二个二进制值，比如1），发信号通知max_latency_frames语法元素的值来指示对帧重排序延迟的约束。例如，在这种情况下，发信号通知的max_latency_frames语法元素的值可以是任何非负整数值。In a first implementation, a flag, max_latency_limitation_flag, is signaled. If the flag has a first binary value (e.g., 0), no constraints are imposed on the frame reordering delay. In this case, the value of the max_latency_frames syntax element is not signaled or is ignored. Otherwise (the flag has a second binary value, such as 1), the value of the max_latency_frames syntax element is signaled to indicate the constraints on the frame reordering delay. For example, in this case, the value of the signaled max_latency_frames syntax element can be any non-negative integer value.

在第二种实现中，发信号通知语法元素max_latency_frames_plus1来指示对帧重排序延迟的约束。如果max_latency_frames_plus1具有第一值（例如0），不对帧重排序延迟施加任何约束。对于其它的值（例如非零值），对帧重排序延迟的约束的值被设置为max_latency_frames_plus1-1。例如，max_latency_frames_plus1的值在0到2³²-2范围内，包括端点值。In a second implementation, a syntax element, max_latency_frames_plus1, is signaled to indicate a constraint on the frame reordering delay. If max_latency_frames_plus1 has a first value (e.g., 0), no constraint is imposed on the frame reordering delay. For other values (e.g., non-zero values), the value of the constraint on the frame reordering delay is set to max_latency_frames_plus1-1. For example, the value of max_latency_frames_plus1 is in the range of 0 to 2 ³² -2, inclusive.

类似地，在第三种实现中，发信号通知语法元素max_latency_frames来指示对帧重排序延迟的约束，如果max_latency_frames具有第一值（例如最大值），对帧重排序延迟不施加任何约束。对于其它值（例如小于最大值的值），对帧重排序延迟的约束的值被设置为max_latency_frames。Similarly, in a third implementation, a syntax element max_latency_frames is signaled to indicate a constraint on the frame reordering delay. If max_latency_frames has a first value (e.g., a maximum value), no constraint is imposed on the frame reordering delay. For other values (e.g., values less than the maximum value), the value of the constraint on the frame reordering delay is set to max_latency_frames.

在第四种实现中，相对于帧存储器最大尺寸指示对帧重排序延迟的约束。例如，延迟约束作为相对于num_reorder_frames语法元素的增加被发信号通知。通常，对帧重排序延迟的约束（以帧为单位）大于或等于num_reorder_frames。为了节省延迟约束的信令中的比特，编码（例如通过使用无符号的指数哥伦布编码、一些其它形式的熵编码）并且之后发信号通知该延迟约束和num_reorder_frames之间的差。发信号通知语法元素max_latency_increase_plus1来指示对帧重排序延迟的约束。如果max_latency_increase_plus1具有第一值（例如0），则对帧重排序延迟不施加任何约束。对于其它值（例如非零值），对帧重排序延迟的约束的值被设置为num_reorder_frames+max_latency_increase_plus1-1。例如，max_latency_increase_plus1的值在0到2³²-2范围中，包括端点值。In a fourth implementation, the constraint on the frame reordering delay is indicated relative to the maximum frame memory size. For example, the delay constraint is signaled as an increment relative to the num_reorder_frames syntax element. Typically, the constraint on the frame reordering delay (in frames) is greater than or equal to num_reorder_frames. To save bits in signaling the delay constraint, the difference between the delay constraint and num_reorder_frames is encoded (e.g., using unsigned exponential Golomb coding or some other form of entropy coding) and then signaled. The constraint on the frame reordering delay is signaled using the syntax element max_latency_increase_plus1. If max_latency_increase_plus1 has a first value (e.g., 0), no constraint is imposed on the frame reordering delay. For other values (e.g., non-zero values), the value of the constraint on the frame reordering delay is set to num_reorder_frames + max_latency_increase_plus1 - 1. For example, the value of max_latency_increase_plus1 ranges from 0 to 2 ³² -2, inclusive.

可替换地，指示对帧重排序延迟的约束的一个或多个语法元素以一些其它方式发信号通知。Alternatively, the syntax element or elements indicating the constraints on the frame reordering delay are signaled in some other manner.

D. 指示对延迟的约束的其它方式D. Other ways to indicate constraints on latency

在许多前述的示例中，对延迟的约束是以帧计数为单位表示的对帧重排序延迟的约束。更一般地，对延迟的约束是以帧计数为单位表示或以秒、毫秒或另一时间量度为单位表示的对延时的约束。例如，对延迟的约束可以被表示为绝对的时间量度，比如1秒或0.5秒之类。编码器可以将这样的时间量度转换为帧计数（考虑视频的帧率），之后编码该视频从而使得视频序列的多个帧之间的帧间相关性与该帧计数一致。或者，不论帧重排序和帧间相关性如何，该编码器可以使用该时间量度来限制延时可以被用于平滑编码视频的比特率、编码复杂度、网络带宽等等中的短期波动的程度。解码器可以使用该时间量度来确定何时可以从解码图像缓冲器中输出帧。In many of the aforementioned examples, the delay constraint is a constraint on the frame reordering delay expressed in units of frame counts. More generally, the delay constraint is a constraint on the delay expressed in units of frame counts or in units of seconds, milliseconds, or another time measure. For example, the delay constraint can be expressed as an absolute time measure, such as 1 second or 0.5 seconds. The encoder can convert such a time measure into a frame count (taking into account the frame rate of the video) and then encode the video so that the inter-frame correlation between multiple frames of the video sequence is consistent with the frame count. Alternatively, regardless of frame reordering and inter-frame correlation, the encoder can use the time measure to limit the extent to which the delay can be used to smooth out short-term fluctuations in the bit rate, encoding complexity, network bandwidth, etc. of the encoded video. The decoder can use the time measure to determine when a frame can be output from the decoded picture buffer.

对延迟的约束可以在传输器端和接收器端之间协商，以便在响应度（缺少延时）与平滑编码视频的比特率中的短期波动的能力、平滑编码复杂度中的短期波动的能力、平滑网络带宽中的短期波动的能力和/或其它从增加的延时中获益的因素之间做出权衡。在这样的协商中，以独立于帧率的方式建立和表征对延迟的约束可能是有帮助的。之后可以考虑该视频的帧率，在编码和解码期间应用该约束。或者，可以不管该视频的帧率如何，在编码和解码期间应用该约束。Delay constraints can be negotiated between the transmitter and receiver to balance responsiveness (lack of delay) with the ability to smooth short-term fluctuations in the bit rate of the encoded video, the ability to smooth short-term fluctuations in encoding complexity, the ability to smooth short-term fluctuations in network bandwidth, and/or other factors that benefit from increased delay. In such negotiations, it may be helpful to establish and characterize the delay constraints in a frame rate-independent manner. The constraints can then be applied during encoding and decoding, taking into account the frame rate of the video. Alternatively, the constraints can be applied during encoding and decoding regardless of the frame rate of the video.

E. 用于设置和输出语法元素的一般化技术E. Generalized Techniques for Setting and Outputting Syntax Elements

图6示出了用于设置和输出促进降低延迟的解码的语法元素的示例技术（600）。例如，参考图2a和2b描述的实时通信工具或编码工具执行该技术（600）。可替换地，另一工具执行该技术（600）。FIG6 shows an example technique (600) for setting and outputting syntax elements that facilitate reduced-latency decoding. For example, the real-time communication tool or encoding tool described with reference to FIG2a and 2b performs the technique (600). Alternatively, another tool performs the technique (600).

首先，该工具设置（610）指示与视频序列的多个帧之间的帧间相关性一致的对延迟（例如帧重排序延迟、以时间量度为单位的延迟）的约束的一个或多个语法元素。当该工具包含视频编码器时，相同的工具还可以接收帧，编码帧来产生编码数据（使用与对帧重排序延迟的约束一致的帧间相关性）并且输出该编码数据来用于存储或传输。First, the tool sets (610) one or more syntax elements indicating a constraint on a delay (e.g., a frame reordering delay, a delay in time units) consistent with an inter-frame correlation between a plurality of frames of a video sequence. When the tool comprises a video encoder, the same tool can also receive frames, encode the frames to produce coded data (using the inter-frame correlation consistent with the constraint on the frame reordering delay), and output the coded data for storage or transmission.

典型地，对帧重排序延迟的约束是对该视频序列中的任何帧都允许的重排序延迟。可以以各种方式来表示该约束，但是，不同的方式具有不同的其它含义。例如，可以用按照输出顺序可以先于一个给定帧但是按照编码顺序跟随该给定帧的帧的最大计数来表示该约束。或者，该约束可以被表示为在该视频序列中的任何帧的编码顺序与输出顺序之间的最大差。或者，集中在单独的帧上，该约束可以被表示为与该视频序列中给定的特定帧相关联的重排序延迟。或者，集中在一组帧上，该约束可以被表示为与该视频序列的该组帧相关联的重排序延迟。或者可以以一些其它方式来表示该约束。Typically, the constraint on the frame reordering delay is the reordering delay that is allowed for any frame in the video sequence. The constraint can be expressed in various ways, however, different ways have different additional meanings. For example, the constraint can be expressed as a maximum count of frames that can precede a given frame in output order but follow the given frame in encoding order. Alternatively, the constraint can be expressed as the maximum difference between the encoding order and the output order of any frame in the video sequence. Alternatively, focusing on individual frames, the constraint can be expressed as a reordering delay associated with a given particular frame in the video sequence. Alternatively, focusing on a group of frames, the constraint can be expressed as a reordering delay associated with the group of frames in the video sequence. Alternatively, the constraint can be expressed in some other way.

接下来，该工具输出（620）该（多个）语法元素。这促进了对于按照多个帧的输出顺序何时重建帧为输出做好准备的确定。该（多个）语法元素可以被输出为基本编码视频比特流中的序列参数集或图像参数集的一部分、还包含用于该帧的编码数据的媒体存储文件或媒体传输流的语法的一部分、媒体特性协商协议的一部分（例如在系统级协商中流或会话参数值的交换期间）、与用于该帧的编码数据多路复用的媒体系统信息的一部分、或与用于该帧的编码数据有关的媒体元数据的一部分（例如在SEI消息或VUI消息中）。可以输出不同的语法元素来指示存储器容量需求。例如，缓冲器尺寸语法元素（例如max_dec_frame_buffering)可以指示DPB的最大尺寸，而帧存储器语法元素（例如num_reorder_frames）可以指示用于重排序的帧存储器的最大尺寸。Next, the tool outputs (620) the syntax element(s). This facilitates determining when a reconstructed frame is ready for output in the output order of the multiple frames. The syntax element(s) may be output as part of a sequence parameter set or picture parameter set in an elementary coded video bitstream, as part of the syntax of a media storage file or media transport stream that also contains the coded data for the frame, as part of a media feature negotiation protocol (e.g., during the exchange of stream or session parameter values in a system-level negotiation), as part of media system information multiplexed with the coded data for the frame, or as part of media metadata related to the coded data for the frame (e.g., in an SEI message or a VUI message). Different syntax elements may be output to indicate memory capacity requirements. For example, a buffer size syntax element (e.g., max_dec_frame_buffering) may indicate the maximum size of the DPB, while a frame memory syntax element (e.g., num_reorder_frames) may indicate the maximum size of the frame memory used for reordering.

可以以各种方式来表示对延迟的约束值，如章节V.C中描述的那样。例如，该工具输出指示该（多个）语法元素存在和不存在的标记。如果该标记指示该（多个）语法元素不存在，那么对延迟的约束是未定义的或者具有默认值。否则，该（多个）语法元素遵循并指示对延迟的约束。或者，该（多个）语法元素的一个值指示对延迟的约束是未定义的或具有默认值，而该（多个）语法元素的其它可能值指示对延迟的约束的整数计数。或者，对于其中对延迟的约束是对帧重排序延迟的约束的情况，该（多个）语法元素的一个给定值相对于用于重排序的帧存储器的最大尺寸指示对帧重排序延迟的约束的整数值，其由诸如num_reorder_frames之类的不同语法元素指示。可替换地，以一些其它方式来表示对延迟的约束。The value of the delay constraint can be expressed in various ways, as described in Section V.C. For example, the tool outputs a flag indicating the presence or absence of the syntax element(s). If the flag indicates the absence of the syntax element(s), then the delay constraint is undefined or has a default value. Otherwise, the syntax element(s) follow and indicate the delay constraint. Alternatively, one value of the syntax element(s) may indicate that the delay constraint is undefined or has a default value, while other possible values of the syntax element(s) indicate an integer count of the delay constraint. Alternatively, for the case where the delay constraint is a constraint on the frame reordering delay, a given value of the syntax element(s) indicates an integer value of the frame reordering delay constraint relative to the maximum size of the frame memory used for reordering, which is indicated by a different syntax element such as num_reorder_frames. Alternatively, the delay constraint can be expressed in some other way.

在一些实现中，根据时间层次来组织视频序列的各帧。在这种情况下，不同的语法元素可以针对该时间层次的不同时间层来指示对帧重排序延迟的不同约束。In some implementations, the frames of a video sequence are organized according to a temporal hierarchy. In this case, different syntax elements may indicate different constraints on the frame reordering delay for different temporal layers of the temporal hierarchy.

F. 用于接收和使用语法元素的一般化技术F. Generalized Techniques for Receiving and Using Syntactic Elements

图7示出了用于接收和使用促进降低延迟的解码的语法元素的示例技术（700）。例如，参考图2a和2b描述的实时通信工具或回放工具执行该技术（700）。可替换地，另一工具执行该技术（700）。FIG7 shows an example technique (700) for receiving and using syntax elements that facilitate reduced-latency decoding. For example, the real-time communication tool or playback tool described with reference to FIG2a and 2b performs the technique (700). Alternatively, another tool performs the technique (700).

首先，该工具接收和解析（710）指示对延迟（例如帧重排序延迟、以时间量度为单位的延迟）的约束的一个或多个语法元素。例如，该解析包含从比特流中读取指示对延迟的约束的一个或多个语法元素。该工具还接收（720）用于视频序列的多个帧的编码数据。该工具可以解析该（多个）语法元素并且，基于（多个）语法元素，确定对延迟的约束。典型地，对帧重排序延迟的约束是对于该视频序列中的任何帧所允许的重排序延迟。可以以各种方式表示该约束，但是，不同方式具有不同的其它含义，正如前一章节描述的那样。该（多个）语法元素可以作为基本编码视频比特流中的序列参数集或图像参数集的一部分、用于媒体存储文件或媒体传输流的语法的一部分、媒体特性协商协议的一部分、与该编码数据多路复用的媒体系统信息的一部分、或与该编码数据有关的媒体元数据的一部分来发信号通知。该工具可以接收和解析指示存储器容量需求的不同语法元素，例如诸如max_dec_frame_buffering之类的缓冲器尺寸语法元素和诸如num_reorder_frames之类的帧存储器语法元素。First, the tool receives and parses (710) one or more syntax elements indicating a constraint on a delay (e.g., a frame reordering delay, a delay in units of time). For example, the parsing includes reading one or more syntax elements indicating a constraint on the delay from a bitstream. The tool also receives (720) coded data for a plurality of frames of a video sequence. The tool can parse the syntax element(s) and, based on the syntax element(s), determine a constraint on the delay. Typically, the constraint on the frame reordering delay is a reordering delay that is allowed for any frame in the video sequence. The constraint can be expressed in various ways, however, different ways have different additional meanings, as described in the previous section. The syntax element(s) can be signaled as part of a sequence parameter set or a picture parameter set in an elementary coded video bitstream, part of a syntax for a media storage file or a media transport stream, part of a media feature negotiation protocol, part of media system information multiplexed with the coded data, or part of media metadata associated with the coded data. The tool may receive and parse different syntax elements indicating memory capacity requirements, such as a buffer size syntax element such as max_dec_frame_buffering and a frame memory syntax element such as num_reorder_frames.

可以以各种方式表示对延迟的约束值，正如章节V.C中描述的那样。例如，该工具接收指示该（多个）语法元素存在或不存在的标记。如果该标记指示该（多个）语法元素不存在，那么对延迟的约束是未定义的或者具有默认值。否则，该（多个）语法元素遵循并指示对延迟的约束。或者，该（多个）语法元素的一个值指示对延迟的约束是未定义的或具有默认值，而该（多个）语法元素的其它可能值指示对延迟的约束的整数计数。或者，对于其中对延迟的约束是对帧重排序延迟的约束的情况，该（多个）语法元素的一个给定值相对于用于重排序的帧存储器的最大尺寸指示对帧重排序延迟的约束的整数计数，其由诸如num_reorder_frames之类的不同语法元素指示。可替换地，以一些其它方式发信号通知对延迟的约束。The value of the delay constraint can be expressed in various ways, as described in Section V.C. For example, the tool receives a flag indicating the presence or absence of the syntax element(s). If the flag indicates that the syntax element(s) are absent, then the delay constraint is undefined or has a default value. Otherwise, the syntax element(s) follow and indicate the delay constraint. Alternatively, one value of the syntax element(s) may indicate that the delay constraint is undefined or has a default value, while other possible values of the syntax element(s) indicate an integer count of the delay constraint. Alternatively, for the case where the delay constraint is a constraint on the frame reordering delay, a given value of the syntax element(s) indicates an integer count of the frame reordering delay constraint relative to the maximum size of the frame memory used for reordering, as indicated by a different syntax element such as num_reorder_frames. Alternatively, the delay constraint is signaled in some other manner.

回到图7，该工具至少解码（730）编码数据中的一些来重建一帧。该工具输出（740）所重建的帧。在这样做时，该工具可以使用对延迟的约束来确定例如根据该视频序列的各帧的输出顺序所重建的帧何时为输出做好准备。Returning to FIG7 , the tool decodes ( 730 ) at least some of the encoded data to reconstruct a frame. The tool outputs ( 740 ) the reconstructed frame. In doing so, the tool can use a constraint on latency to determine when the reconstructed frame is ready for output, for example, according to the output order of the frames of the video sequence.

在一些实现中，根据时间层次来组织该视频序列中的各帧。在这种情况下，不同的语法元素可以针对该时间层次的不同时间层指示对帧重排序延迟的不同约束。该工具可以根据该输出的时间分辨率来选择对帧重排序延迟的不同约束中的一个。In some implementations, the frames in the video sequence are organized according to a temporal hierarchy. In this case, different syntax elements may indicate different constraints on the frame reordering delay for different temporal layers of the temporal hierarchy. The tool may select one of the different constraints on the frame reordering delay based on the temporal resolution of the output.

考虑到可以将所公开的本发明的原理应用到许多可能的实施例中，应当认识到所说明的实施例仅仅是本发明的优选示例并且不应当被看作是对本发明范围的限制。相反地，本发明的范围由随附的权利要求限定。因此本人要求进入这些权利要求的范围和精神内的所有发明作为我的发明来保护。In view of the many possible embodiments that can be applied to the principles of the disclosed invention, it should be recognized that the described embodiments are merely preferred examples of the invention and should not be taken as limiting the scope of the invention. Rather, the scope of the invention is defined by the appended claims. I therefore claim protection as my invention for all inventions that come within the scope and spirit of these claims.

Claims

1. A method in a computing system for implementing a video decoder, comprising:

Receive and parse one or more syntax elements that indicate constraints on frame reordering delay, wherein the constraints on frame reordering delay are represented by the maximum count of frames that can precede any frame in the video sequence in output order but follow that frame in encoding order.

Receive encoded data of multiple frames from a video sequence;

Using the video decoder, at least some of the encoded data is decoded to reconstruct one of the multiple frames; and

Output the reconstructed frame.

2. The method according to claim 1, further comprising:

The constraints on frame reordering delay are determined based on one or more of these syntax elements; and

This constraint on frame reordering delay is used to determine when the reconstructed frame is ready for output based on the output order of the multiple frames in the video sequence.

3. The method of claim 2, wherein the plurality of frames of the video sequence are organized according to a time hierarchy, wherein different syntax elements indicate different constraints on frame reordering delay for different time levels of the time hierarchy, the method further comprising selecting one of the different constraints on frame reordering delay according to the temporal resolution of the output.

4. The method of claim 1, wherein the constraint on the frame reordering delay is defined as the maximum difference between the encoding order and the output order of any frames in the video sequence.

5. The method of claim 1, wherein the one or more syntax elements and the encoded data are signaled as part of a syntax for encoding a video bitstream, the method further comprising:

Receive and parse a buffer size syntax element that indicates the maximum size of the decoded image buffer, wherein the buffer size syntax element is different from one or more syntax elements that indicate constraints on frame reordering delay.

6. The method of claim 1, wherein the one or more syntax elements are signaled as part of a sequence parameter set, a picture parameter set, a syntax for a media storage file that also contains the encoded data, a syntax for a media transport stream that also contains the encoded data, a media feature negotiation protocol, media system information multiplexed with the encoded data, or media metadata related to the encoded data.

7. The method according to claim 1, further comprising:

Receive a flag indicating the presence or absence of the one or more syntax elements, wherein if the flag indicates the absence of the one or more syntax elements, then the constraint on the frame reordering delay is undefined or has a default value.

8. The method according to claim 1, wherein:

One possible value of the one or more syntax elements indicates that the constraint on frame reordering delay is undefined or has a default value, and the other possible values of the one or more syntax elements indicate an integer count of the constraint on frame reordering delay.

9. The method according to claim 1, wherein:

A value of one or more syntax elements indicates an integer count of constraints on frame reordering delay relative to the maximum size of the frame memory used for reordering, the maximum size of which is indicated by different syntax elements.

10. The method according to claim 9, wherein:

The constraint on frame reordering delay can be determined as the maximum count of the maximum size of the frame memory used for reordering plus the integer count of the constraint on frame reordering delay minus one.

11. A method in a computing system, comprising:

Syntax elements that set one or more constraints on frame reordering delays, the constraints on frame reordering delays being consistent with the inter-frame correlations between multiple frames in a video sequence, wherein the constraints on frame reordering delays are represented by the maximum count of frames that, in output order, precede any frame in the video sequence but follow that frame in coding order; and

Output one or more syntax elements to facilitate the determination of when a frame is ready for output based on the output order of the multiple frames.

12. The method of claim 11, wherein the computing system implements a video encoder, the method further comprising:

Receive the multiple frames of the video sequence;

Using this video encoder, the multiple frames are encoded to produce encoded data, wherein the encoding uses inter-frame correlation consistent with constraints on frame reordering delay; and

Output the encoded data for storage or transmission.

13. The method of claim 11, wherein the one or more syntax elements and the encoded data are output as part of a syntax for encoding a video bitstream, the method further comprising:

The output is a buffer size syntax element that indicates the maximum size of the decoded image buffer, wherein the buffer size syntax element is different from the one or more syntax elements that indicate constraints on frame reordering delay.

14. The method of claim 11, wherein the one or more syntax elements are output as part of a sequence parameter set, a picture parameter set, a syntax for a media storage file that also contains encoded data of the plurality of frames, a syntax for a media transport stream that also contains encoded data of the plurality of frames, a media feature negotiation protocol, media system information multiplexed with the encoded data of the plurality of frames, or media metadata relating to the encoded data of the plurality of frames.

15. The method of claim 11, further comprising:

The output is a flag indicating whether the one or more syntax elements exist or not, wherein if the flag indicates that the one or more syntax elements do not exist, then the constraint on the frame reordering delay is undefined or has a default value.

16. The method of claim 11, wherein:

17. The method of claim 11, wherein:

18. The method of claim 17, wherein:

19. A video decoding system, the video decoding system comprising:

The memory is configured to receive one or more syntax elements that indicate constraints on frame reordering delays and to receive encoded data for multiple frames of a video sequence.

The decoder is configured to:

The constraints on frame reordering delay are determined based on one or more syntax elements, wherein the constraints on frame reordering delay are represented by the maximum count of frames that, in output order, precede a given frame but in encoding order, follow that given frame; and

At least some of the encoded data needs to be decoded to reconstruct one of the multiple frames; and

A frame buffer is configured to store reconstructed frames, wherein the video decoding system is configured to use constraints on frame reordering delays to determine when the reconstructed frames are ready for output according to the output order of the plurality of frames of the video sequence.

20. The video decoding system of claim 19, wherein the one or more syntax elements are signaled as part of a sequence parameter set or media metadata relating to the encoded data of the plurality of frames.

21. The video decoding system according to claim 19, wherein:

One of the values of the one or more syntax elements is an integer count indicating the constraint on frame reordering latency relative to the maximum size of the frame memory used for reordering.

22. The video decoding system according to claim 21, wherein:

23. A video encoding system, the video encoding system comprising:

A frame buffer is configured to receive multiple frames of a video sequence.

The encoder is configured to:

Syntax elements that set one or more constraints on frame reordering delays, which correspond to the inter-frame correlations between multiple frames in a video sequence, wherein the constraints on frame reordering delays are represented by the maximum count of frames that can precede any frame in the video sequence in output order but follow that frame in encoding order; and

The multiple frames are encoded to produce encoded data, wherein the encoding uses inter-frame correlation consistent with the constraint on frame reordering delay; and

The memory is configured to buffer the encoded data and one or more syntax elements for output, thereby facilitating the determination of when a frame is ready for output based on the output order of the multiple frames.

24. The video coding system of claim 23, wherein the one or more syntax elements are signaled as part of a sequence parameter set or media metadata relating to the encoded data of the plurality of frames.

25. The video encoding system according to claim 23, wherein:

26. The video encoding system according to claim 25, wherein: