US20130235928A1

US20130235928A1 - Advanced coding techniques

Info

Publication number: US20130235928A1
Application number: US13/652,311
Authority: US
Inventors: Chris Y. Chung; Hao Pan; Jiefu Zhai; Yeping Su; Douglas Scott Price; Hsi-Jung Wu; Xiaosong ZHOU
Original assignee: Apple Inc
Current assignee: Apple Inc
Priority date: 2012-03-06
Filing date: 2012-10-15
Publication date: 2013-09-12
Also published as: US20130235942A1; US9432694B2

Abstract

Embodiments of the present invention provide techniques for efficiently coding/decoding video data during circumstances when constraints are imposed on the video data. A frame from a video sequence may be marked as a delayed decoder refresh frame. Frames successive to the delayed decoder refresh frame in coding order may predictively coded without reference to frames preceding the delayed decoder refresh frame in coding order. The distance between the delayed decoder refresh frame and the successive frames may exceed a distance threshold. Frames successive to a current frame in decoding order may be decoded without reference to frames preceding the current frame in decoding order. The distance between the current frame and the successive frames may exceed a distance threshold.

Description

CROSS-REFERENCE TO RELATED APPLICATION

This application claims the benefit of priority afforded by provisional application Ser. No. 61/607,484, filed Mar. 6, 2012, entitled “Improvements in Video Preprocessors and Video Coders.”

BACKGROUND

In video coder/decoder systems, a video coder may code a source video sequence into a coded representation that has a smaller bit rate than does the source video and, thereby may achieve data compression. The video coder may code processed video data according to any of a variety of different coding techniques to achieve compression. One common technique for data compression uses predictive coding techniques (e.g., temporal/motion predictive coding). For example, some frames in a video stream may be coded independently (I-frames) and some other frames (e.g., P-frames or B-frames) may be coded using other frames as reference frames. P-frames may be coded with reference to a single previously coded frame (called, a “reference frame”) and B-frames may be coded with reference to a pair of previously-coded reference frames, typically a reference frame that occurs prior to the B-frame in display order and another reference frame that occurs subsequently to the B-frame in display order. The resulting compressed sequence (bit stream) may be transmitted to a decoder via a channel. To recover the video data, the bit stream may be decompressed at the decoder by inverting the coding processes performed by the coder, yielding a recovered video sequence.
A video coder may need to achieve a particular target compression ratio based on factors such as network bandwidth. Thus, certain frames of a video sequence may be coded with a higher compression than other frames in the video sequence. Typically, the higher the compression, the lower the resulting image quality. Consequently, the frames with relatively high compression may have a lower visual quality than adjacent frames, leading to sudden changes in visual quality in the video sequence. Therefore, designers of video coding systems endeavor to provide coding systems that maintain smooth transitions in the visual quality of video.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a simplified block diagram of a video coding system according to an embodiment of the present invention.

FIG. 2 is a functional block diagram of a video coding system according to an embodiment of the present invention.

FIG. 3 is a simplified block diagram of a video coding system of another embodiment of the present invention.

FIG. 4 illustrates a method to iteratively code a video sequence to achieve a target bitrate according to an embodiment of the present invention.

FIG. 5 illustrates a method to code a video sequence based on information generated on a previous coding pass according to an embodiment.

FIG. 6 illustrates a method to estimate a file size of a video sequence according to an embodiment.

FIG. 7 illustrates a video sequence including a selected random access pictures (RAP) frame, according to an embodiment.

FIG. 8 illustrates a method to select a RAP frame according to an embodiment.

DETAILED DESCRIPTION

Embodiments of the present invention provide techniques for efficiently coding/decoding video data during circumstances when constraints are imposed on the video data. According to the embodiments, coding parameters of a video sequence may be selected based on a target bit rate. The video sequence may be predictively coded based on the parameters. If the target bit rate is not achieved, regions of the video sequence with high bit rates may be identified, a filtering strength applied to the identified regions may be increased, and the video sequence may be predictively coded with the increased filtering strength.
In an embodiment, on a first coding pass, a video sequence may be coded based on a first set of coding parameters. Values of a characteristic of frames from the video sequence may be stored during the first coding pass. Frames which violate a constraint imposed on the characteristic based on the stored values may be identified. Target characteristic values for the frames from the video sequence may be determined. The target characteristic values may be lower than the constraint. A second set of coding parameters to achieve the target characteristic values may be computed. On a second pass, the video sequence may be coded based on the second set of coding parameters.
In an embodiment, perceptual model values may be determined from a video sequence. An index into a matrix may be computed based on the perceptual model values. The matrix may store associations between parameter range(s) and file sizes. A file size may be retrieved from the matrix corresponding to the computed index. The video sequence may be predictively coded with parameters associated with the file size.
In an embodiment, a motion compensated error energy of a current frame of a video sequence may be computed. A weighted average motion compensated error energy of frames successive to the current frame in coding order may be computed. If the motion compensated error energy of the current frame is higher than the weighted average motion compensated error energy of the successive frames by a first factor, a weighted average motion compensated error energy of the current frame and frames adjacent to the current frame may be computed. If the weighted average motion compensated error energy of the current frame and the adjacent frames is higher than the weighted average motion compensated error energy of the successive frames by a second factor, the current frame may be marked as a random access pictures frame. The current frame may be predictively coded.
In an embodiment, a frame from a video sequence may be marked as a delayed decoder refresh frame. Frames successive to the delayed decoder refresh frame in coding order may predictively coded without reference to frames preceding the delayed decoder refresh frame in coding order. The distance between the delayed decoder refresh frame and the successive frames may exceed a distance threshold.
In an embodiment, frames successive to a current frame in decoding order may be decoded without reference to frames preceding the current frame in decoding order. The distance between the current frame and the successive frames may exceed a distance threshold.
FIG. 1 is a simplified block diagram of a video coding system 100 according to an embodiment of the present invention. The system 100 may include at least two terminals 110-120 interconnected via a network 150. For unidirectional transmission of data, a first terminal 110 may code video data at a local location for transmission to the other terminal 120 via the network 150. The second terminal 120 may receive the coded video data of the other terminal from the network 150, decode the coded data and display the recovered video data. Unidirectional data transmission is common in media serving applications and the like.
FIG. 1 illustrates a second pair of terminals 130, 140 provided to support bidirectional transmission of coded video that may occur, for example, during videoconferencing. For bidirectional transmission of data, each terminal 130, 140 may code video data captured at a local location for transmission to the other terminal via the network 150. Each terminal 130, 140 also may receive the coded video data transmitted by the other terminal, may decode the coded data and may display the recovered video data at a local display device.
In FIG. 1, the terminals 110-140 are illustrated as servers, personal computers and smart phones but the principles of the present invention are not so limited. Embodiments of the present invention find application with laptop computers, tablet computers, media players and/or dedicated video conferencing equipment. The network 150 represents any number of networks that convey coded video data among the terminals 110-140, including, for example, wireline and/or wireless communication networks. The communication network 150 may exchange data in circuit-switched and/or packet-switched channels. Representative networks include telecommunications networks, local area networks, wide area networks and/or the Internet. For the purposes of the present discussion, the architecture and topology of the network 150 are immaterial to the operation of the present invention unless explained hereinbelow.
FIG. 2 is a functional block diagram of a video coding system 200 according to an embodiment of the present invention. The system 200 may include a video source 210 that provides video data to be coded by the system 200, a pre-processor 220, a video coder 230, a transmitter 240 and a controller 250 to manage operation of the system 200.
The video source 210 may provide video to be coded by the rest of the system 200. In a media serving system, the video source 210 may be a storage device storing previously prepared video. In a videoconferencing system, the video source 210 may be a camera that captures local image information as a video sequence. Video data typically is provided as a plurality of individual frames that impart motion when viewed in sequence. The frames themselves typically are organized as a spatial array of pixels.
The pre-processor 220 may perform various analytical and signal conditioning operations on video data. The pre-processor 220 may parse input frames into color components (for example, luminance and chrominance components) and also may parse the frames into pixel blocks, spatial arrays of pixel data, which may form the basis of further coding. The pre-processor 220 also may apply various filtering operations to the frame data to improve efficiency of coding operations applied by a video coder 230.
The video coder 230 may perform coding operations on the video sequence to reduce the video sequence's bit rate. The video coder 230 may include a coding engine 232, a local decoder 233, a reference picture cache 234, a predictor 235 and a controller 236. The coding engine 232 may code the input video data by exploiting temporal and spatial redundancies in the video data and may generate a datastream of coded video data, which typically has a reduced bit rate as compared to the datastream of source video data. As part of its operation, the video coder 230 may perform motion compensated predictive coding, which codes an input frame predictively with reference to one or more previously-coded frames from the video sequence that were designated as “reference frames.” In this manner, the coding engine 232 codes differences between pixel blocks of an input frame and pixel blocks of reference frame(s) that are selected as prediction reference(s) to the input frame.
In an embodiment, a video coder 230 may be provided as a multi-pass coder in which portions of the video sequence may be coded iteratively in a trial-and-error manner using different coding parameters to achieve a target bit rate. Typically, the target bit rate represents a number of bits per unit time. Results of one coding pass may be exchanged with the pre-processor 220 to improve results of a subsequent coding pass.
The local decoder 233 may decode coded video data of frames that are designated as reference frames. Operations of the coding engine 232 typically are lossy processes. When the coded video data is decoded at a video decoder (not shown in FIG. 2), the recovered video sequence typically is a replica of the source video sequence with some errors. The local decoder 233 replicates decoding processes that will be performed by the video decoder on reference frames and may cause reconstructed reference frames to be stored in the reference picture cache 234. In this manner, the system 200 may store copies of reconstructed reference frames locally that have common content as the reconstructed reference frames that will be obtained by a far-end video decoder (absent transmission errors).
The predictor 235 may perform prediction searches for the coding engine 232. That is, for a new frame to be coded, the predictor 235 may search the reference picture cache 234 for image data that may serve as an appropriate prediction reference for the new frames. The predictor 235 may operate on a pixel block-by-pixel block basis to find appropriate prediction references. In some cases, as determined by search results obtained by the predictor 235, an input frame may have prediction references drawn from multiple frames stored in the reference picture cache 234.
The controller 236 may manage coding operations of the video coder 230, including, for example, selection of coding parameters to meet a target bit rate of coded video. Typically, video coders operate according to constraints imposed by bit rate requirements, quality requirements and/or error resiliency policies; the controller 236 may select coding parameters for frames of the video sequence in order to meet these constraints. For example, the controller 236 may assign coding modes and/or quantization parameters to frames and/or pixel blocks within frames.
The transmitter 240 may buffer coded video data to prepare it for transmission to the far-end terminal (not shown). The transmitter 240 may merge coded video data from the video coder 230 with other data to be transmitted to the terminal, for example, coded audio data and/or ancillary data streams (sources not shown).
The controller 250 may manage operation of the system 200. During coding, the controller 250 may assign to each frame a certain frame type (either of its own accord or in cooperation with the controller 236), which can affect the coding techniques that are applied to the respective frame. For example, frames often are assigned as one of the following frame types:

- An Intra Frame (I frame) is one that is coded and decoded without using any other frame in the sequence as a source of prediction,
- A Predictive Frame (P frame) is one that is coded and decoded using earlier frames in the sequence as a source of prediction.
- A Bidirectionally Predictive Frame (B frame) is one that is coded and decoded using both earlier and future frames in the sequence as sources of prediction.

Frames commonly are parsed spatially into a plurality of pixel blocks (for example, blocks of 4×4, 8×8 or 16×16 pixels each) and coded on a pixel block-by-pixel block basis. Pixel blocks may be coded predictively with reference to other coded pixel blocks as determined by the coding assignment applied to the pixel blocks' respective frames. For example, pixel blocks of I frames can be coded non-predictively or they may be coded predictively with reference to pixel blocks of the same frame (spatial prediction). Pixel blocks of P frames may be coded non-predictively, via spatial prediction or via temporal prediction with reference to one previously coded reference frame. Pixel blocks of B frames may be coded non-predictively, via spatial prediction or via temporal prediction with reference to one or two previously coded reference frames.
FIG. 3 is a simplified block diagram of a video coding system 300 of another embodiment of the present invention, illustrating the operation of pixel-block coding operations. The system 300 may include a pre-processor 310, a block-based coder 320, a reference frame decoder 330, a reference picture cache 340, a predictor 350, a transmit buffer 360 and a controller 370.
The block-based coder 320 may include a subtractor 321, a transform unit 322, a quantizer 323 and an entropy coder 324. The subtractor 321 may generate data representing a difference between the source pixel block and a reference pixel block developed for prediction. The subtractor 321 may operate on a pixel-by-pixel basis, developing residuals at each pixel position over the pixel block. Non-predictively coded blocks may be coded without comparison to reference pixel blocks, in which case the pixel residuals are the same as the source pixel data.
The coder 320 may be provided as a multi-pass coder in which portions of the video sequence may be coded iteratively in a trial-and-error manner using different coding parameters to achieve a target bit rate. Results of one coding pass may be exchanged with the pre-processor 310 to improve results of a subsequent coding pass.
The transform unit 322 may convert the source pixel block data to an array of transform coefficients, such as by a discrete cosine transform (DCT) process or a wavelet transform. The quantizer unit 323 may quantize (divide) the transform coefficients obtained from the transform unit 322 by a quantization parameter QP. The entropy coder 324 may code quantized coefficient data by run-value coding, run-length coding or the like. Data from the entropy coder may be output to the channel as coded video data of the pixel block. The reference frame decoder 330 may decode pixel blocks of reference frames and assemble decoded data for such reference frames. Decoded reference frames may be stored in the reference picture cache 340.
The predictor 350 may generate and output prediction blocks to the subtractor 321. The predictor 350 also may output metadata identifying type(s) of predictions performed. For inter-prediction coding, the predictor 350 may search among the reference picture cache for pixel block data of previously coded and decoded frames that exhibits strong correlation with the source pixel block. When the predictor 350 finds an appropriate prediction reference for the source pixel block, it may generate motion vector data that is output to the decoder as part of the coded video data stream. The predictor 350 may retrieve a reference pixel block from the reference cache that corresponds to the motion vector and may output it to the subtractor 321. For intra-prediction coding, the predictor 350 may search among the previously coded and decoded pixel blocks of the same frame being coded for pixel block data that exhibits strong correlation with the source pixel block. Operation of the predictor 350 may be constrained by a mode selection provided by the controller 370. For example, if a controller selects an inter-coding mode for application to a frame, the predictor 350 will be constrained to use inter-coding techniques. If the controller selects an inter-prediction mode for the frame, the predictor may select among inter-coding modes and intra-coding modes depending upon results of its searches.
A transmit buffer 360 may accumulate metadata representing pixel block coding order, coded pixel block data and metadata representing coding parameters applied to the coded pixel blocks. The metadata can include prediction modes, motion vectors and quantization parameters applied during coding. Accumulated data may be formatted and transmitted to the channel.
A controller 370 may manage coding of the source video, including selection of a coding mode for use by the predictor 350 and selection of quantization parameters to be applied to pixel blocks.
FIG. 4 illustrates a method 400 to iteratively code a video sequence to achieve a target bitrate according to an embodiment of the present invention. A pre-processor may filter a portion of a source video sequence according to an initial set of coding parameters (box 410). A video coder may select coding parameters for the sequence portion based on an estimate of the target bit rate (box 420). The video coder may code the sequence portion according to the coding parameters (box 430). The video coder may then determine whether the coded video data obtained from coding satisfies the target bit rate (box 440). If so, the results of the coding operation may be terminated for the current portion of the sequence (box 490), and the coding operation may advance to another portion of the sequence, if available. Otherwise, the video coder may identify regions of the coded sequence that generate high bit rates (box 450). In response to the identified regions, the pre-processor may increase filtering strengths as applied to the regions in the source data (box 460) and operation may return to box 420 for another coding pass.
The initial set of coding parameters utilized by the pre-processor (box 410) may be predetermined, may be a default set of parameters, or they may be derived from an analysis of the source video performed by the pre-processor and related controls. The coding parameters selected by the video coder may involve selections of prediction mode for frames within the sequence portion and quantization parameters applied to pixel blocks within the frames.
Filtering and coding operations performed by method 400 may vary among different regions of the video sequence. Typically, due to correlation among frames, the identified regions will persist across a common spatial region of multiple frames in a portion of the video sequence being coded. Accordingly, recursive operations of the video coder and pre-processor may be performed on a single frame of the video or they may be performed on a set of several frames (say, 10 frames) of the video sequence.
In an embodiment, when a video coder identifies a region with high bit rates (box 450), the video coder may adjust target bit rates of frames with the regions at the expense of frames that do not have such regions (box 470). When operation returns to box 420 for another coding pass, the video coder may select coding parameters corresponding to each frame's target bit rate. For example, the video coder may select relatively higher quantization parameters for frames that have high bit rate regions in prior passes, which tends to reduce bit rates of such frames as the expensive of lower coding quality.
When a video sequence is coded, the sequence may be governed by a coding policy that imposes constraints on certain characteristics of the video sequence. For example, a constraint may limit a bit rate over the video sequence. Another constraint may limit the bit rate over a fixed window of frames. In another example, a constraint may define a minimum threshold on the visual quality of a window of frames. To adhere to the constraints, the video coder may adjust coding parameters such as a quantization parameter (QP) and mode selection. Specifically, the video coder may react to constraint breakages as they occur and manage the parameter through models which map QPs to characteristic levels. For example, for constraints on data rate, a QP-to-bits mapping model may be utilized and for constraints on visual quality, a QP-to-peak signal-to-noise ratio (QP-to-PSNR) mapping model may be utilized. However, this may lead to isolated frames which have noticeably worse characteristics than neighboring frames. In the case of constraints on data rate, one frame can be coded much smaller and have much poorer visual quality than surrounding frames, and in the case of constraints on visual quality, one frame can have much higher visual quality and be coded much bigger than surrounding frames, resulting in a poor viewing experience for the end-viewer.
FIG. 5 illustrates a method 500 to code a video sequence based on information generated on a previous coding pass according to an embodiment. A video coder may code, on a first coding pass, a video sequence based on an initial set of coding parameters (box 510). A controller controlling the operations of the video coder may compute revised parameters for the video coder based on the first coding pass (box 520). The video coder may then re-code the video sequence again utilizing the revised parameters (box 530).
In an embodiment, to compute the revised parameters, during the first coding pass, attributes of frames affecting the coding parameters may be stored. Based on the stored attributes, frames which violate constraints may be identified. Then, the coding parameters for a subsequent coding iteration may be adjusted so that parameter changes from one frame to another are gradual, while simultaneously ensuring that constraints are not violated for any of the frames in the video sequence. For example, when a portion of the video sequence that violates one or more constraints is identified, a window of support may be developed to the beginning of a scene in which the constraint is violated when shaping a characteristic curve of the frames in that window. Thus, coding parameters may be adjusted smoothly to avoid sudden changes in visual quality within a scene.
Given a first coding pass of a sequence of frames and a constraint, the characteristic values resulting from the first coding may be stored and transformed to form a list (called, for example, “targetCharacteristicCurveArray”) of desired characteristic values that satisfy the constraint. A second coding of the sequence of frames may generate (within a tolerance) the desired characteristic values as stored in targetCharacteristicCurveArray.
In an embodiment, as the second coding progresses, the values in targetCharacteristicCurveArray for coded frames may be updated with the actual characteristic values, and values in targetCharacteristicCurveArray for future frames may be adjusted to reflect the actual values that are generated.
A data rate constraint may be expressed as a pair of values: maximum data rate and fixed length window of presentation times over which to compute the data rate. In another example, a data rate constraint may be inferred from a model such as the Hypothetical Reference Decoder. A visual quality constraint may impose a minimum visual quality (given some metric, such as PSNR) over specified set of frames.
A constraint may be the decoder complexity required to decode a video sequence. Another constraint may be the amount of heat dissipated by a decoder while decoding a video sequence (decoder thermal generation). A constraint may be the amount of energy utilized by a decoder to decode a video sequence (decoder power usage or battery drainage). Another constraint may be the visual quality of a video sequence in dark scenes. Still another constraint may be the quality degradation through visual masking.
The set of frames over which a constraint is imposed need not be successive frames and can be imposed over all frames with a perceptual model score in a particular range, all frames with average luma in some range, etc. The perceptual model score may be based on the visual quality and/or video complexity of a video sequence. The perceptual model score may be computed from spatial, temporal, and spatiotemporal visual masking values derived by a pre-processor.
The transformation applied to characteristic values resulting from the first coding pass may include scaling the values by a fixed constant. For example, for a data rate constraint, if a set of frames total A bits and is greater than a specified max of B bits, individual frame sizes can be scaled by B/A. Further, the scaling factor applied to each frame may be modulated by something simple as the relative size of the frame; it may also be modulated by the perceptual significance of the frame.
In the case of multiple constraints, the transformation may be computed jointly or sequentially. For example, a set of frames may violate both data rate and visual quality constraints. A sequential transformation may update the targetCharacteristicCurveArray based on the data rate constraint and then on the visual rate constraint. A joint transformation may update the targetCharacteristicCurveArray on both constraints simultaneously, for example, by targeting a weighted average of both measures.
In an embodiment, the method 500 may be performed on portions of a video sequence rather than the entirety of a video sequence and may operate at various granularities such as a sequence of frames belonging to a common scene, a group of dependent frames, short clips, single frames, slices, and coding units (pixel blocks).
In an embodiment, a number of coding strategies can be employed to minimize the number of encodes for any frame, including the use of an adaptive QP-to-characteristic model. QP-to-characteristic models may include QP-to-bits and QP-to-PSNR as explained above.
In an embodiment, analytical operations performed for curve shaping may be performed by a pre-processor. Management of the targetCharacteristicCurveArray and selection of coding parameters may be performed by controller(s) within the system.
A video coder may receive a data rate value or a quality level as a control input along with a video sequence to determine the size of the output coded bitstream. Often, the data rate may be specified without regard to the content of the video; likewise, often, the quality level may be specified without regard to the resulting data rate.
FIG. 6 illustrates a method 600 to estimate a file size of a video sequence according to an embodiment. The method 600 may scan the video sequence and compute perceptual model values therefrom (box 610), based on, for example, spatial, temporal, and spatiotemporal visual masking values derived by a pre-processor from content of the video sequence. Based on the computed values, the method 600 may develop an index into a Data Rate/Quality matrix, taking into account the resolution and the frame rate of the video (box 620). The method 600 may then retrieve a file size estimate based on the information in the matrix (box 630).
Perceptual model values may be developed in a variety of ways. For example, a single number may be distilled from a number of values. In an embodiment, a single weighted value may be computed from, for example, spatial, temporal, and spatiotemporal visual masking values derived by a pre-processor from content of the video sequence. The method 600 may be performed by a coder controller, which also may store the Data Rate/Quality matrix.
In an embodiment, the Data Rate/Quality matrix may store values representing optimum file sizes for video along a multi-dimensional parameter range. An example of a multi-dimensional parameter range may be a data rate range and a quality range, where the Data Rate/Quality matrix is dependent on the resolution, duration and frame rate of the video. The Data Rate/Quality matrix may also store file size values based on thermal output range, power utilization range, and decoder complexity range. In an embodiment, file size values stored in the matrix may be derived from operation of similar coders on other training sequences having their own perceptual model values, resolution, duration and frame rate and the file sizes generated by those coders.
In an embodiment, the optimum file size may minimize the weighted sum of a quality degradation value and the resulting bit rate. The quality degradation value could include metrics for visual quality through a perceptual model over a noiseless channel and for a number of noisy channels. In other embodiments, the optimum file size may minimize the weighted sum of the resulting decoder complexity and the resulting decoder power/thermal output. In an embodiment, the coder may search for the optimum coding by sampling the specified Data Rate/Quality space.
In an embodiment, the method 600 may be performed on portions of a video sequence rather than the entirety of a video sequence and may operate at various granularities such as short clips, single frames, slices, and coding units (pixel blocks). For example, in the case of video capture, buffered frames can be processed to calculate the optimum file size.
When frames cannot be buffered (e.g. in a real-time application scenario) or only a small number of frames can be buffered, an accurate global optimum file size may be difficult to calculate. Therefore, a quality level that is predefined or calculated based on the history can be used to avoid spending excessive bits in frames that already reach acceptable quality. The bits saved can be used in later frames that require more bits to reach the quality level, or not used at all such that the bit rate of the entire stream can be reduced.
Random access pictures (RAPs) represent another coding constraint of a system. RAPs facilitate random access among a bitstream. RAPs are placed where no natural scene changes exist, incurring bit rate spikes and often causing sudden changes in the visual quality in a scene (visual flashes). Inserting RAPs in the middle of scenes may be inefficient because RAPs are often coded as Instantaneous Decoder Refresh (IDR) frames. An IDR is a type of I-frame which forces the decoder to refresh its state immediately, guaranteeing that no state prior or subsequent (in decode order) to the IDR is necessary for decoding the IDR. This break in coding dependency causes the aforementioned bit rate spike in coding and visual flashes when the coded video is decoded and displayed.
To minimize bit rate spikes and visual flashes, a frame may be identified as a RAP frame based on relative motion masking. FIG. 7 illustrates a video sequence 700 including a selected RAP frame 710, according to an embodiment. Frame 710 from the video sequence 700 may be selected as a RAP frame if the frames 720 before frame 710 in coding order have a relatively high motion masking and the frames 730 after frame 710 in coding order have a relatively low motion masking.
Motion masking may be computed as a weighted average over the video segment of motion compensated error energy (MCEE). The MCEE represents the amount of pixel changes between frames, along motion trajectories (motion vectors). The MCEE may be computed, for example, as a sum of absolute differences (SAD), sum of squared differences (SSD), etc., between successive frames in that video segment.
In an embodiment, the motion masking levels may be determined by a pre-processor and/or a controller.
FIG. 8 illustrates a method 800 to select a RAP frame according to an embodiment. The method 800 may compute the MCEE of the current frame and determine whether it exceeds a threshold (box 810). If it does, the method 800 may compute a weighted average of motion compensated error energy (WMCEE) of the frames successive to the current frame (box 820). If the MCEE of the current frame exceeds the WMCEE of the successive frames by a first factor (830), the method 800 may compute a WMCEE of the current frame and frames adjacent to the current frame (840). If the WMCEE of the current frame and adjacent frames exceed the WMCEE of the successive frames (box 850), the current frame may be selected as a RAP frame. Otherwise, the method 800 may determine whether the next frame qualifies as a RAP frame.
In an embodiment, if multiple frames within a video sequence are identified as potential RAP frames by method 800 and the potential RAP frames are within a proximity threshold to each other, not all potential RAP frames need to be selected as RAP frames. In an embodiment, the last potential RAP frame (in decoding order) can be selected as a RAP frame. In another embodiment, a subsampling of potential RAP frames may be selected as RAP frames.
In an embodiment, the highest motion masking video segment within a video sequence may be identified and a selected RAP frame may be inserted into that video segment.
In an embodiment, RAPs within a scene may be re-used to reduce the bit rate overhead. A re-usable RAP frame may be defined as a “Delayed Decoder Refresh” (DDR) frame. A DDR frame may not force an immediate state refresh at a decoder but guarantees that state information from frames prior to the DDR frame (in decode order) is not necessary to decode the DDR frame itself or to decode frames subsequent to the DDR frame (in decode order) that are more than a specified number N_delayof frames from the DDR frame. Thus, the DDR frame may be used as a reference frame for the frames immediately after it (in decode order) and as a RAP for the frames N_delay+1 frames after it (in decode order).
In an embodiment, multiple delays (delay0, delay1, . . . , delayX) may be associated with a single a DDR frame to indicate that the DDR frame may be used as a RAP for X+1 video segments. The frame at the beginning (in decode order) of the each such segment may include the appropriate delay value, N_delay. In an embodiment, setting N_delayto 0 may indicate that the DDR is to be inserted immediately as an IDR frame in the associated video segment.
In an embodiment, information pertaining to the DDR frame such as the number N_delaymay be signaled in the channel data 260 (FIG. 2) as part of syntax defining a DDR frame.
In an embodiment, a frame may specify the DDR frame which it needs as a reference frame. This may be done by giving each DDR frame an identifier which may be signaled in the bitstream using a specified number of bits.
The foregoing discussion has described operation of the embodiments of the present invention in the context of coders and decoders. Commonly, video coders are provided as electronic devices. They can be embodied in integrated circuits, such as application specific integrated circuits, field programmable gate arrays and/or digital signal processors. Alternatively, they can be embodied in computer programs that execute on personal computers, notebook computers or computer servers. Similarly, decoders can be embodied in integrated circuits, such as application specific integrated circuits, field programmable gate arrays and/or digital signal processors, or they can be embodied in computer programs that execute on personal computers, notebook computers or computer servers. Decoders commonly are packaged in consumer electronics devices, such as gaming systems, DVD players, portable media players and the like and they also can be packaged in consumer software applications such as video games, browser-based media players and the like.
Several embodiments of the invention are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the invention are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention.

Claims

We claim:

1. A video coding method, comprising:

selecting coding parameters of a video sequence based on a target bit rate;

predictively coding the video sequence based on the parameters; and

if the target bit rate is not achieved:

identifying regions of the video sequence with high bit rates,

increasing a filtering strength applied to the identified regions, and

predictively coding the video sequence with the increased filtering strength.

2. A video coding method, comprising:

predictively coding, on a first coding pass, a video sequence based on a first set of coding parameters;

storing values of a characteristic of frames from the video sequence during the first coding pass;

identifying frames which violate a constraint imposed on the characteristic based on the stored values;

determining target characteristic values for the frames from the video sequence, wherein the target characteristic values are lower than the constraint;

computing a second set of coding parameters to achieve the target characteristic values; and

predictively coding, on a second pass, the video sequence based on the second set of coding parameters.

3. The method of claim 2, wherein the characteristic is at least one of data rate, visual quality, decoder complexity, decoder thermal generation, and decoder power usage.

4. The method of claim 2, wherein the second set of coding parameters includes a quantization parameter.

5. A video coding method, comprising:

determining perceptual model values from a video sequence;

computing an index based on the perceptual model values into a matrix, wherein the matrix stores associations between at least one parameter range and file sizes;

retrieving a file size from the matrix corresponding to the computed index; and

predictively coding the video sequence with parameters associated with the file size.

6. The method of claim 5, wherein the perceptual model values are determined from at least one of spatial and temporal visual masking values.

7. The method of claim 5, wherein the at least one parameter range includes at least one of data rate range, resolution range, thermal output range, power utilization range, and decoder complexity range.

8. A video coding method, comprising:

determining a number of buffered frames from a video sequence available to calculate an optimum file size; and

if the number of available buffered frames is below a threshold, predictively coding the video sequence to achieve one of a predetermined video quality level or a video quality level determined from previous coding history.

9. A video coding method, comprising:

computing a motion compensated error energy of a current frame of a video sequence;

computing a weighted average motion compensated error energy of frames successive to the current frame in coding order;

if the motion compensated error energy of the current frame is higher than the weighted average motion compensated error energy of the successive frames by a first factor, computing a weighted average motion compensated error energy of the current frame and frames adjacent to the current frame;

if the weighted average motion compensated error energy of the current frame and the adjacent frames is higher than the weighted average motion compensated error energy of the successive frames by a second factor, marking the current frame as a random access pictures frame; and

predictively coding the current frame.

10. A video coding method, comprising:

marking a frame from a video sequence as a delayed decoder refresh frame; and

predictively coding frames successive to the delayed decoder refresh frame in coding order without reference to frames preceding the delayed decoder refresh frame in coding order, wherein the distance between the delayed decoder refresh frame and the successive frames exceeds a distance threshold.

11. The method of claim 10, further comprising:

predictively coding the delayed decoder refresh frame without reference to the frames preceding the delayed decoder refresh frame in coding order.

12. The method of claim 10, further comprising:

communicating the distance threshold to a decoder.

13. A decoding method, comprising:

decoding frames successive to a current frame in decoding order without reference to frames preceding the current frame in decoding order, wherein a distance between the current frame and the successive frames exceeds a distance threshold.

14. The method of claim 13, wherein the current frame is marked by a coder as a delayed decoder refresh frame.

15. The method of claim 13, further comprising:

decoding the current frame without reference to the frames preceding the current frame in decoding order.

16. A coding apparatus, comprising:

a controller to select coding parameters of a video sequence based on a target bit rate; and

a coding engine to:

predictively code the video sequence based on the parameters, and

if the target bit rate is not achieved:

identify regions of the video sequence with high bit rates, and

predictively code the video sequence with an increased filtering strength applied to the identified regions by a pre-processor.

17. A coding apparatus, comprising:

a coding engine to:

predictively code, on a first coding pass, a video sequence based on a first set of coding parameters, and

predictively code, on a second coding pass, the video sequence based on a second set of coding parameters;

a storage device to store values of a characteristic of frames from the video sequence during the first coding pass; and

a controller to:

identify frames which violate a constraint imposed on the characteristic based on the stored values,

determine target characteristic values for the frames from the video sequence, wherein the target characteristic values are lower than the constraint, and

compute the second set of coding parameters to achieve the target characteristic values.

18. The apparatus of claim 17, wherein the characteristic is at least one of data rate, visual quality, decoder complexity, decoder thermal generation, and decoder power usage.

19. The apparatus of claim 17, wherein the second set of coding parameters includes a quantization parameter.

20. A coding apparatus, comprising:

a pre-processor to determine perceptual model values from a video sequence;

a controller to:

compute an index based on the perceptual model values into a matrix, wherein the matrix stores associations between at least one parameter range and file sizes, and

retrieve a file size from the matrix corresponding to the computed index; and

a coding engine to predictively coding the video sequence with parameters associated with the file size.

21. The apparatus of claim 20, wherein the perceptual model values are determined from at least one of spatial and temporal visual masking values.

22. The apparatus of claim 20, wherein the at least one parameter range includes at least one of data rate range, resolution range, thermal output range, power utilization range, and decoder complexity range.

23. A coding apparatus, comprising:

a controller to:

determine a number of buffered frames from a video sequence available to calculate an optimum file size, and

determine if the number of available buffered frames is below a threshold; and

a coding engine to:

if the number of available buffered frames is below the threshold, predictively code the video sequence to achieve one of a predetermined video quality level or a video quality level determined from previous coding history.

24. A coding apparatus, comprising:

a controller to:

compute a weighted average motion compensated error energy of frames successive to the current frame in coding order,

if the motion compensated error energy of the current frame is higher than the weighted average motion compensated error energy of the successive frames by a first factor, compute a weighted average motion compensated error energy of the current frame and frames adjacent to the current frame, and

if the weighted average motion compensated error energy of the current frame and the adjacent frames is higher than the weighted average motion compensated error energy of the successive frames by a second factor, mark the current frame as a random access pictures frame; and

a coding engine to predictively code the current frame.

25. A coding apparatus, comprising:

a controller to mark a frame from a video sequence as a delayed decoder refresh frame; and

a coding engine to predictively code frames successive to the delayed decoder refresh frame in coding order without reference to frames preceding the delayed decoder refresh frame in coding order, wherein the distance between the delayed decoder refresh frame and the successive frames exceeds a distance threshold.

26. The coding apparatus of claim 25, wherein the coding engine is further configured to predictively code the delayed decoder refresh frame without reference to the frames preceding the delayed decoder refresh frame in coding order.

27. The coding apparatus of claim 25, further comprising:

a channel to communicate the distance threshold to a decoder.

28. A decoding apparatus, comprising:

a decoding engine to decode frames successive to a current frame in decoding order without reference to frames preceding the current frame in decoding order, wherein a distance between the current frame and the successive frames exceeds a distance threshold.

29. The decoding apparatus of claim 28, wherein the current frame is marked by a coder as a delayed decoder refresh frame.

30. The decoding apparatus of claim 28, wherein the decoding engine is further configured to decode the current frame without reference to the frames preceding the current frame in decoding order.

31. A storage device storing program instructions that, when executed by a processor, cause the processor to:

select coding parameters of a video sequence based on a target bit rate;

predictively code the video sequence based on the parameters; and

if the target bit rate is not achieved:

identify regions of the video sequence with high bit rates,

increase a filtering strength applied to the identified regions, and

predictively code the video sequence with the increased filtering strength.

32. A storage device storing program instructions that, when executed by a processor, cause the processor to:

predictively code, on a first coding pass, a video sequence based on a first set of coding parameters;

store values of a characteristic of frames from the video sequence during the first coding pass;

identify frames which violate a constraint imposed on the characteristic based on the stored values;

determine target characteristic values for the frames from the video sequence, wherein the target characteristic values are lower than the constraint;

compute a second set of coding parameters to achieve the target characteristic values; and

predictively code, on a second pass, the video sequence based on the second set of coding parameters.

33. A storage device storing program instructions that, when executed by a processor, cause the processor to:

determine perceptual model values from a video sequence;

compute an index based on the perceptual model values into a matrix, wherein the matrix stores associations between at least one parameter range and file sizes;

retrieve a file size from the matrix corresponding to the computed index; and

predictively code the video sequence with parameters associated with the file size.

34. A storage device storing program instructions that, when executed by a processor, cause the processor to:

compute a motion compensated error energy of a current frame of a video sequence;

compute a weighted average motion compensated error energy of frames successive to the current frame in coding order;

if the motion compensated error energy of the current frame is higher than the weighted average motion compensated error energy of the successive frames by a first factor, compute a weighted average motion compensated error energy of the current frame and frames adjacent to the current frame;

predictively code the current frame.

35. A storage device storing program instructions that, when executed by a processor, cause the processor to:

mark a frame from a video sequence as a delayed decoder refresh frame; and

predictively code frames successive to the delayed decoder refresh frame in coding order without reference to frames preceding the delayed decoder refresh frame in coding order, wherein the distance between the delayed decoder refresh frame and the successive frames exceeds a distance threshold.

36. A storage device storing program instructions that, when executed by a processor, cause the processor to:

decode frames successive to a current frame in decoding order without reference to frames preceding the current frame in decoding order, wherein a distance between the current frame and the successive frames exceeds a distance threshold.