HK1163989B

HK1163989B - Improved interpolation of video compression frames

Info

Publication number: HK1163989B
Application number: HK12104108.7A
Authority: HK
Inventors: G．A．迪莫斯
Original assignee: 杜比实验室认证公司
Priority date: 2002-06-28
Filing date: 2012-04-25
Publication date: 2016-05-27

Description

Improved interpolation of video compression frames

The present application is a divisional application of chinese patent application 03814629.0 entitled "video image compression method" filed on 27/6/2003.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a partial continuation of U.S. application serial No. 09/904203 filed on 11/7/2001 and claims priority to U.S. c.i.p. application serial No. 10/187395 filed on 28/6/2002.

Technical Field

The present invention relates to video compression and more particularly to improved interpolation of video compression frames in MPEG-like encoding and decoding systems.

Background

MPEG video compression

MPEG-2 and MPEG-4 are international video compression standards that define respective video syntaxes that provide an efficient way to represent image sequences in more compact encoded data. The language in which the bits are encoded is the so-called "syntax". For example, several markers may represent an entire block of samples (e.g., 64 samples for MPEG-2). Both MPEG standards describe a decoding (reconstruction) process, i.e. the transformation of coded bits from a compact representation into an approximation of the original format of the image sequence. For example, a flag in the encoded bitstream may indicate whether the bits are to be processed using a prediction algorithm before subsequent bits are decoded using a Discrete Cosine Transform (DCT) algorithm. The algorithm comprising the decoding process is specified by the semantics defined by these MPEG standards. This syntax can be used to exploit common video features such as spatial redundancy, temporal redundancy, uniform motion, spatial masking, etc. An MPEG decoder must be able to parse and decode the incoming data stream, but a wide variety of possible data structures and compression techniques can be used as long as the data stream conforms to the corresponding MPEG syntax (although technically this deviates from the standard because of semantic inconsistencies). It is also possible to carry the required semantics within some alternative syntax.

These MPEG standards use various compression methods including intra-frame methods and inter-frame methods. In most video scenes, the background remains relatively stable while motion occurs in the foreground. The background may move, but typically many scenes are redundant. These MPEG standards begin compression by creating reference frames called "intra" frames or "I frames". I-frames are compressed without reference to other frames and therefore contain a complete frame of video information. I-frames provide entry points for a randomly accessed data bit stream, but can only be moderately compressed. Typically, the data representing the I-frames is placed every 12 to 15 frames in the bitstream (although it may be useful in some cases to use wider spacing between I-frames). Accordingly, only image differences are captured, compressed and stored since only a small portion of the frames falling between reference I frames differ from bracketing (framing) I frames. Two types of frames are used for this difference-predicted frames (P-frames) and bi-directionally predicted (or interpolated) frames (B-frames).

P-frames are typically encoded from past frames (I-frames or previous P-frames) and are typically used as references for subsequent P-frames. P frames are subject to a considerable amount of compression. B-frames provide the highest amount of compression, but require both past and future reference frames in order to be encoded. P-frames and I-frames are "referenceable frames" in that they can be referenced by P-frames or B-frames.

A macroblock (macroblock) is an image pixel area. For MPEG-2, a macroblock is a 16x16 pixel group of 48 x8DCT blocks, plus one motion vector for P frames and one or two motion vectors for B frames. Macroblocks within P-frames may be individually encoded using intra-or inter (predictive) coding. Macroblocks within B frames may be individually encoded using intra-frame coding, forward predictive coding, backward predictive coding, or both forward and backward (i.e., bi-directional interpolated) predictive coding. The structure used in MPEG-4 video coding is slightly different from this, but similar.

After encoding, the MPEG data bitstream comprises a sequence of I-frames, P-frames and B-frames. The sequence may consist of almost any pattern of I-frames, P-frames and B-frames (with several minor semantic restrictions on their location). However, it is common in industry practice to have a fixed frame pattern (e.g., IBBPBBPBBPBBPBB).

Motion vector prediction

In MPEG-2 and MPEG-4 (and similar standards, such as H.263), the use of B-type (bi-directionally predicted) frames has proven to be beneficial for compression efficiency. The motion vector of each macroblock of such a frame can be predicted by any one of three methods:

mode 1: forward prediction from previous I or P frames (i.e., non-bidirectionally predicted frames).

Mode 2: backward prediction from subsequent I or P frames.

Mode 3: bi-directional prediction from subsequent and previous I or P frames.

Mode 1 is the same as the forward prediction method for P frames. Mode 2 is the same concept except that it works backwards from the subsequent frame. Mode 3 is an interpolation mode that combines information from previous and subsequent frames.

In addition to these three modes, MPEG-4 also supports another interpolated motion vector prediction mode for B frames: direct mode prediction, which uses motion vectors from subsequent P frames, plus a delta () value (if motion vectors from co-located P macroblocks are split into an 8x8 mode-yielding 4 motion vectors for a 16x16 macroblock-this delta is applied to all 4 independent motion vectors in the B frame). The motion vector of the subsequent P frame points to the previous P frame or I frame. A ratio is used to weight the motion vectors of subsequent P frames. The ratio is the relative temporal position of the current B frame with respect to the subsequent P frame and the previous P (or I) frame.

Fig. 1 is a timeline of frames and MPEG-4 direct mode motion vectors according to the prior art. The concept of MPEG-4 direct mode (mode 4) means that the motion of the macroblock in each intervening B frame may be close to the motion used to encode the same location in the following P frame. The delta is used to make a small correction to the proportional motion vector derived from the corresponding Motion Vector (MV)103 of the subsequent P frame. Fig. 1 shows the proportional weighting given to the motion vectors 101 and 102 of each intermediate B frame 104a, 104B as a function of the "temporal distance" between the previous P or I frame 105 and the next P frame 106. The motion vectors 101 and 102 assigned to the respective intermediate B frames 104a, 104B are equal to the assigned weight values (1/3 and 2/3, respectively) multiplied by the motion vector 103 of the next P frame, plus the delta value.

With MPEG-2, all prediction modes of B-frames are tested in encoding and compared to find the best prediction for each macroblock. If there is no good prediction, then the macroblock is coded as an "I" ("intra") macroblock alone. The coding mode is selected as the best mode among forward (mode 1), backward (mode 2) and bi-directional (mode 3) coding, or as intra coding. With MPEG-4, no intra coding is allowed to be selected. Instead, the direct mode becomes the fourth choice. Also, the optimal coding mode is selected based on some best match criterion. In MPEG-2 and MPEG-4 based software encoders, DC matching (sum of absolute differences, or "SAD") is used to determine the best match.

The number of consecutive B frames in the encoded data bit stream is determined by the parameter value "M" in MPEG. M minus 1 is the number of B frames between each P frame and the next P frame (or I frame). Therefore, if M is 3, there are 2B frames between each P frame (or I frame), as shown in fig. 1. In terms of the definition of the value of M (and thus the number of consecutive B frames), the main limitation is that the amount of motion change between P frames (or I frames) becomes large. More B frames means longer time between P frames (or I frames). Therefore, the efficiency and coding range limitations of motion vectors constitute a limit to the number of intermediate B frames.

It is also important to note that the P frames carry "change energy" to advance with the moving picture stream, since each decoded P frame is used as a starting point for predicting the next subsequent P frame. However, B frames are discarded after use. Thus, any bits used to create a B frame are used only for that frame, and unlike a P frame, they do not provide corrections that aid in decoding of subsequent frames.

Disclosure of Invention

The present invention relates to a method, system and computer program for improving the image quality of one or more predicted frames in a video image compression system, wherein each frame comprises a plurality of pixels.

In one aspect, the invention includes determining the value of each pixel of a bi-directionally predicted frame as a weighted proportion of the values of corresponding pixels in a non-bi-directionally predicted frame that encloses a bi-directionally predicted frame sequence. In one embodiment, the weighted proportion is a function of the distance between the bracketing non-bidirectionally predicted frames. In another embodiment, the weighted proportion is a blended function of the distance between the bracketing non-bidirectionally predicted frames and the equal average of the bracketing non-bidirectionally predicted frames.

In another aspect of the invention, the interpolated representation of the pixel values is represented in a linear space, or other optimized non-linear space different from the original non-linear representation.

Other aspects of the invention include systems, computer programs and methods comprising:

● has a video image compression system comprising a sequence of referenceable frames of picture regions, wherein at least one picture region of at least one predicted frame is encoded with reference to two or more referenceable frames.

● have a video image compression system comprising a sequence of referenceable frames comprising picture regions, wherein at least one picture region of at least one predicted frame is encoded with reference to one or more referenceable frames in display order, wherein at least one such referenceable frame is not the referenceable frame closest to the front of the at least one predicted frame in display order.

● have a video image compression system comprising a sequence of referenceable frames comprising macroblocks, wherein at least one macroblock within at least one predicted frame is encoded by interpolation from two or more referenceable frames.

● has a video image compression system comprising a sequence of referenceable and bidirectional predicted frames comprising picture regions, wherein at least one picture region of at least one bidirectional predicted frame is encoded to comprise at least two or more motion vectors, each such motion vector referencing a corresponding picture region in the at least one referenceable frame.

● has a video image compression system comprising a sequence of referenceable frames comprising image regions, wherein at least one image region of at least one predicted frame is encoded to comprise at least two motion vectors, each such motion vector referencing a corresponding image region in a referenceable frame, wherein each such image region of such at least one predicted frame is encoded by interpolation from two or more referenceable frames.

● has a video image compression system comprising a sequence of referenceable and bidirectional predicted frames of picture regions, wherein at least one picture region of at least one bidirectional predicted frame is encoded as an unequal weighting of selected picture regions from two or more referenceable frames.

● has a video image compression system having a sequence of referenceable and bidirectional predicted frames including picture regions, wherein at least one picture region of at least one bidirectional predicted frame is encoded by interpolation from two or more referenceable frames, wherein at least one of the two or more referenceable frames is separated in display order from the bidirectional predicted frame by at least one intervening referenceable frame, and wherein such at least one picture region is encoded as an unequal weighting of selected picture regions of such at least two or more referenceable frames.

● has a video image compression system comprising a sequence of referenceable and bidirectional predicted frames including picture regions, wherein at least one picture region of at least one bidirectional predicted frame is encoded by interpolation from two or more referenceable frames, wherein at least one of the two or more referenceable frames is separated from the bidirectional predicted frame in display order by at least one intervening subsequent referenceable frame.

● video image compression system having a sequence of predicted frames and bidirectional predicted frames, each frame comprising pixel values arranged in macroblocks, wherein at least one macroblock in a bidirectional predicted frame is determined using direct mode prediction based on motion vectors from two or more predicted frames.

● a video image compression system having a sequence of referenceable and bidirectional predicted frames, each frame comprising pixel values arranged in macroblocks, wherein at least one macroblock in a bidirectional predicted frame is determined using direct mode prediction based on motion vectors from one or more predicted frames in display order, wherein at least one of such one or more predicted frames precedes the bidirectional predicted frame in display order.

● a video image compression system having a sequence of referenceable and bidirectional predicted frames, each frame comprising pixel values arranged in macroblocks, wherein at least one macroblock in a bidirectional predicted frame is determined using direct mode prediction based on motion vectors from one or more predicted frames, wherein at least one of such one or more predicted frames follows the bidirectional predicted frame in display order and is separated from the bidirectional predicted frame by at least one intervening referenceable frame.

● a video image compression system having a sequence of frames, each frame comprising a plurality of image regions having a DC value, each such image region comprising pixels each having an AC pixel value, wherein at least one of the DC value and the AC pixel value of at least one image region of at least one frame is determined as a weighted interpolation of the corresponding respective DC value and AC pixel value from at least one other frame.

● a video image compression system having a sequence of referenceable frames, each frame comprising a plurality of image regions having a DC value, each such image region comprising pixels each having an AC pixel value, wherein at least one of the DC value and the AC pixel value of at least one image region of at least one predicted frame is interpolated from the corresponding respective DC value and AC pixel value of two or more referenceable frames.

● improve image quality of a sequence of two or more bi-directionally predicted inter frames in a video image compression system, each frame comprising a plurality of image regions having a DC value, each such image region comprising pixels each having an AC pixel value, comprising at least one of: determining the AC pixel value of each image region of the bidirectional predictive intermediate frame as a first weighted proportion of corresponding AC pixel values in referenceable frames surrounding the bidirectional predictive intermediate frame sequence; and determining the DC value of each picture region of such a bi-directionally predicted intermediate frame as a second weighted proportion of the corresponding DC value in the referenceable frame surrounding the sequence of bi-directionally predicted intermediate frames. A video image compression system having a sequence of frames comprising a plurality of pixels having an initial representation, wherein pixels of at least one frame are interpolated from corresponding pixels of at least two other frames, wherein such corresponding pixels of the at least two other frames are interpolated when transformed to a different representation and result in the interpolated pixels being transformed back to the initial representation.

● in a video image compression system having a sequence of referenceable and bidirectional predicted frames, dynamically determining the coding mode of such frames having a variable number of bidirectional predicted frames, comprising: selecting an initial sequence that starts with a referenceable frame, has at least one immediately subsequent bidirectional predicted frame, and ends with a referenceable frame; adding a referenceable frame to the end of the initial sequence to create a test sequence; evaluating the test sequence against a selected evaluation criterion; for each satisfaction step of evaluating the test sequence, inserting a bidirectional frame in front of the added referenceable frame and repeating the evaluating step; and if the test sequence is not evaluated, accepting the previous test sequence as the current encoding mode.

● video image compression system having a sequence of referenceable frames separated by at least one bidirectional predicted frame, wherein the number of such bidirectional predicted frames in such sequence varies, and wherein at least one picture region of at least one such bidirectional predicted frame is determined using unequal weighting of pixel values corresponding to at least two referenceable frames.

● a video image compression system having a sequence of frames encoded by an encoder and decoded by a decoder, wherein at least one image region of at least one frame is based on a weighted interpolation of two or more other frames, the weighted interpolation being based on at least one set of weights available to the encoder and decoder, wherein a designation of a selected one of such at least one set of weights is communicated from the encoder to the decoder to select one or more currently valid weights.

● a video image compression system having a sequence of frames encoded by an encoder and decoded by a decoder, wherein at least one image region of at least one frame is based on a weighted interpolation of two or more other frames, the weighted interpolation being based on at least one set of weights, wherein the at least one set of weights is downloaded to the decoder, and thereafter a designation of a selected one of the at least one set of weights is communicated from the encoder to the decoder to select one or more currently valid weights.

● has a video image compression system with a sequence of referenceable frames encoded by an encoder and decoded by a decoder, wherein predicted frames in the sequence of referenceable frames are transmitted from the encoder to the decoder in a different display order than such predicted frames after decoding.

● a video image compression system having a sequence of referenceable frames comprising pixels arranged in picture regions, wherein at least one picture region of at least one predicted frame is encoded with reference to two or more referenceable frames, wherein each such picture region is determined using unequal weighting of pixel values corresponding to such two or more referenceable frames.

● video image compression system having a sequence of predicted frames, bi-directionally predicted frames, and intra frames, each frame including an image region, wherein at least one filter selected from a sharpening and softening filter bank is used for at least one image region of a predicted frame or a bi-directionally predicted frame during motion vector compensated prediction of such predicted frame or bi-directionally predicted frame.

The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.

Drawings

Fig. 1 is a timeline of frames and MPEG-4 direct mode motion vectors according to the prior art.

Fig. 2 is a time line of frames and proportional pixel weighting values in accordance with this aspect of the invention.

FIG. 3 is a time line of frames and blended, scaled and equal pixel weighting values according to this aspect of the invention.

FIG. 4 is a flow chart showing an illustrative embodiment of the present invention as a method that may be computer implemented.

Fig. 5 is a diagram showing an example of multiple previous references by a current P frame to two previous P frames and one previous I frame.

FIG. 6A is a diagram of a typical prior art MPEG-2 coding pattern, showing a constant number of B frames between bracketing I and/or P frames.

FIG. 6B is a diagram of a theoretically possible prior art MPEG-4 video coding pattern, showing a varying number of B frames between bracketing I frames and/or P frames, and varying distances between I frames.

Fig. 7 is a diagram of coding modes.

FIG. 8 is a flow chart illustrating one embodiment of an interpolation method, DC interpolation being different from AC interpolation.

FIG. 9 is a flow chart illustrating one embodiment of a method for interpolation of luminance pixels using an alternative representation.

FIG. 10 is a flow chart illustrating one embodiment of a method for interpolation of chroma pixels using an alternative representation.

Fig. 11 is a diagram showing the unique motion vector region sizes of each of two P frames.

FIG. 12 is a diagram showing a sequence of P and B frames with interpolation weights for the B frames determined as a function of distance from a 2 away (2-away) subsequent P frame that references a 1 away (1-away) subsequent P frame.

FIG. 13 is a diagram showing a sequence of P and B frames with interpolation weights for the B frames determined as a function of distance from a 1-away subsequent P frame that references a 1-away previous P frame.

FIG. 14 is a diagram showing a sequence of P and B frames in which a subsequent P frame has multiple motion vectors referencing prior P frames.

FIG. 15 is a diagram showing a sequence of P and B frames in which a nearest subsequent P frame has a motion vector referencing a preceding P frame, and a next nearest subsequent P frame has multiple motion vectors referencing preceding P frames.

FIG. 16 is a diagram showing a sequence of P and B frames in which a nearest previous P frame has a motion vector referencing a previous P frame.

FIG. 17 is a diagram showing a sequence of P and B frames in which a nearest previous P frame has two motion vectors referencing previous P frames.

FIG. 18 is a diagram showing a sequence of P and B frames in which a nearest previous P frame has a motion vector referencing a previous P frame.

Fig. 19 is a frame sequence showing three P frames P1, P2, P3, where P3 uses an interpolated reference with two motion vectors, one for each of P1 and P2.

FIG. 20 shows a frame sequence of four P frames P1, P2, P3, and P4, where P4 uses an interpolated reference with three motion vectors, one for each of P1, P2, and P3.

FIG. 21 is a diagram showing a sequence of P and B frames in which different P frames have one or more motion vectors referencing different P frames ahead, and showing different weights assigned to respective forward and backward references of a particular B frame.

FIG. 22 is a diagram showing a sequence of P and B frames in which the bitstream order of the P frames is different from the display order.

Fig. 23 is a diagram showing a sequence of P-frame and B-frame sequences with assigned weights.

FIG. 24 is a position time diagram of an object within a frame.

Like reference symbols in the various drawings indicate like elements.

Detailed Description

SUMMARY

One aspect of the present invention is based on the recognition that: it is common practice to use an M value of 3, which provides two B frames between each P (or I) frame. But M-2 and M-4 or higher are useful. It is particularly important to note that the value of M (the number of B frames plus 1) also has a natural relationship to the frame rate. At 24 frames per second (fps), the playback rate of a movie, a temporal distance of 1/24 seconds between frames can result in substantial image changes from frame to frame. But at frame rates of 60fps, 72fps, or higher, the temporal distance between adjacent frames is correspondingly reduced. The result is that as the frame rate increases, a larger number of B frames (i.e., a larger value of M) becomes useful and beneficial in compression efficiency.

Another aspect of the invention is based on the recognition that: both MPEG-2 and MPEG-4 video compression utilize simplistic interpolation methods. For example, for mode 3, the bi-prediction of each macroblock of a frame is an equal average of the subsequent and previous frame macroblocks, which are replaced by two corresponding motion vectors. This equal averaging is suitable for M-2 (i.e., a single intermediate B frame) because the B frame is equal in time to the preceding P (or I) frame and the subsequent P (or I) frame. However, for larger values of M, only centrally symmetric B frames (i.e., intermediate frames if M is 4, 6, 8, etc.) can be best predicted using equal weighting. Similarly, in MPEG-4 direct mode 4, the predicted pixel value of each intermediate B frame is an equal proportion of the corresponding pixels of the preceding P (or I) frame and the subsequent P frame, even though the motion vectors are proportionally weighted.

Thus, for M > 2, it would be an advance to apply the appropriate scaling weights to the predicted pixel values of each B frame. The proportional weight of each pixel of the current B frame corresponds to the relative position of the current B frame with respect to the preceding and subsequent P (or I) frames. Therefore, if M is 3, the first B frame uses 2/3 for the corresponding pixel value (adjusted motion vector) of the previous frame and 1/3 for the corresponding pixel value (adjusted motion vector) of the subsequent frame.

Fig. 2 is a time line of frames and proportional pixel weighting values in accordance with this aspect of the invention. The pixel values within each macroblock of each intermediate B frame 201a, 201B are weighted as a function of the "distance" between the previous P or I frame a and the next P or I frame B, the closer to the P or I frame the greater the weighting value. That is, each pixel value of the bi-directionally predicted B frame is a weighted combination of the corresponding pixel values of the bracketing non-bi-directionally predicted frames a and B. In this example, for M-3, the first B frame 201a is weighted equal to 2/3a +1/3B and the second B frame 201B is weighted equal to 1/3a + 2/3B. Also shown are the equal average weights assigned under the conventional MPEG system; the MPEG-1, 2, and MPEG-4 weighting for each B frame 201a, 201B is equal to (A + B)/2.

Applications of extended dynamic range and contrast range

If M is greater than 2, the proportional weighting of pixel values in the intermediate B frame will in many cases improve the effectiveness of bi-directional (mode 3) and direct (MPEG-4 mode 4) encoding. Examples include common movie and video editing effects such as fade-outs and cross-dissolves. These types of video effects are problematic coding examples of MPEG-2 and MPEG-4, since a simple DC matching algorithm is used, and the commonly used M-3 (i.e. two intermediate B-frames), results in an equal proportion of B-frames. The encoding in these cases is improved by using proportional B frame interpolation according to the present invention.

Proportional B-frame interpolation also has direct application to the improvement of coding efficiency for extended dynamic and contrast ranges. A common event in image coding is a change in brightness, which occurs when an object gradually moves into (or out of) a shadow (soft shadow boundary). If a logarithmic coded representation is used for the luminance (e.g. luminance specifically represented by logarithmic luminance Y), then the lighting luminance change will be a DC offset change. If the lighting brightness is halved, the pixel values will be reduced by an equal amount. Therefore, to encode this change, an AC match should be found and the encoded DC difference should be applied to the region. This DC difference value encoded into the P-frame should also be applied proportionately in each intervening B-frame (see co-pending U.S. patent application No. 09/905039 entitled "method and system for improving compression encoded decoder information", assigned to the assignee of the present invention and hereby incorporated by reference herein for additional information regarding the representation of the logarithmic code).

In addition to changes in brightness, proportional B frame interpolation also benefits changes in contrast. For example, as an airplane moves toward the audience, moving out of clouds or haze, its contrast gradually increases. This contrast increase will be expressed as an increased magnitude in the AC coefficients of the DCT in the corresponding P frame coded macroblock. Also, contrast changes in intervening B frames will be most closely approximated by proportional interpolation, thus improving coding efficiency.

As the frame rate becomes larger and as the value of M increases, it becomes increasingly important to improve the dynamic range and contrast coding efficiency using proportional B frame interpolation.

Applying high M values to temporal layering

Utilizing embodiments of the present invention allows for increasing the value of M, thereby increasing the number of B frames between bracketing P and/or I frames, while maintaining or increasing coding efficiency. This use is beneficial for many applications, including temporal layering. For example, in U.S. patent No. 5988863 entitled "temporal resolution layer for advanced device," which is assigned to the assignee of the present invention and incorporated herein by reference, note that B-frames are a suitable mechanism for layered time (frame) rates. This rate flexibility is related to the number of consecutive B frames available. For example, a single B-frame (M ═ 2) may support a 36fps decoded temporal layer within a 72fps stream or a 30fps decoded temporal layer within a 60fps stream. Three B-frames (M ═ 4) can support 36fps and 18fps decoded temporal layers in a 72fps stream, and 30fps and 15fps decoded temporal layers in a 60fps stream. Temporal layers that can support 12fps, 24fps, and 60fps decoding using M-10 in a 120fps stream. M-4 may also be used for 144fps streams to provide temporal layers decoded at 72fps and 36 fps.

As an improvement to each nth frame, multiple frames of 120fps or 72fps can be decoded and proportionally blended, as described in co-pending U.S. patent application No. 09/545233 entitled "enhanced temporal resolution layering" (assigned to the assignee of the present invention and incorporated herein by reference), to improve the motion blur characteristics of the 24fps result.

Even higher frame rates can be synthesized using the method described in co-pending U.S. patent application No. 09/435277 entitled "system and method for movement assembly and frame rateconversion" (assigned to the assignee of the present invention and incorporated herein by reference). For example, a 72fps camera negative (original) can be used to create an effective frame rate of 288 frames per second through motion compensated frame rate conversion. Using M-12, frame rates of 48fps and 24fps, as well as other useful rates, such as 144fps, 96fps, and 32fps (of course, the negative film is 72fps) can be achieved. The frame rate conversion using this method need not be an integer multiple. For example, an effective rate of 120fps can be created from a 72fps source and then used as a 60fps and 24fps rate source (using M-10).

Therefore, temporal layering is beneficial to optimize the performance of B frame interpolation. The proportional B frame interpolation described above makes a greater number of consecutive B frames function more efficiently and therefore can achieve these benefits.

Blended B-frame interpolation ratios

One reason for using equal average weighting as a motion compensated mode predictor of B frame pixel values in conventional systems is that P (or I) frames preceding or following a particular B frame may be noisy and therefore represent an imperfect match. Equal mixing optimizes noise reduction in the interpolated motion compensation block. There is a differential residual (differential) encoded using a quantized DCT function. Of course, the better the match from the motion compensation scale, the fewer differential residual bits are needed and the higher the resulting image quality.

The true ratio when M > 2 provides a better prediction in the case of objects moving in and out of shadows or haze. However, when light and contrast are unchanged, equal weighting can prove to be a better predictor, since the error of moving the macroblock forward along the motion vector will average the error from the backward displaced block, thus halving each error. Even so, it is more likely that the B frame macroblocks of a closer P (or I) frame are more correlated with that frame than the farther P (or I) frame.

Thus, in some cases, such as a change in contrast or brightness of a region, it is desirable to utilize the true proportion (for brightness and color) of the B frame macroblock pixel weights, as described above. In other cases, it may be more advantageous to utilize equal proportions, as in MPEG-2 and MPEG-4.

Another aspect of the invention utilizes a mixture of these two scaling techniques (equal average and frame-distance scaling) for B frame pixel interpolation. For example, for the case of M ═ 3, 1/3 and 2/3 ratios of 3/4 can be mixed with equal average of 1/4, resulting in two ratios of 3/8 and 5/8. This technique can be generalized using a "blending factor" F.

Weight equal average weight of F (frame distance proportional weight) + (1-F) ·

The useful range of the blending factor F is from 1 to 0, with 1 representing a fully proportional interpolation and 0 representing a fully equal average (the opposite assignment may also be used).

FIG. 3 is a time line of frames and blended, scaled and equal pixel weighting values according to this aspect of the invention. The pixel values of each macroblock of each intermediate B frame 301a and 301B are weighted as a function of the "temporal distance" between the previous P or I frame a and the next P or I frame B, and as a function of the equal average of a and B. In this example, for M3 and blending factor F3/4, the blending weight for the first B frame 301a is equal to 5/8a +3/8B (i.e., 3/4 for proportional weight 2/3a +1/3B plus 1/4 for equal average weight (a + B)/2). Similarly, the second B frame 301B is weighted equal to 3/8A + 5/8B.

The value of the blending factor F can be set for the whole of the encoding or for each group of pictures (GOP), a range of B frames, each B frame or each region within a B frame (including, for example, refinement to each macroblock, or even a single 8x8 motion block for MPEG-4 direct mode using P vectors in 8x8 mode).

To save bits and reflect the fact that the blending ratio is usually not important to be transmitted with each macroblock, the optimal use of blending should be related to the type of picture to be compressed. For example, for images that are fading, fading out, or that have overall lighting or contrast that is gradually changing, it is generally optimal for the blending factor F to be close to or equal to 1 (i.e., proportional interpolation is selected). For continuous images without such lighting or contrast variation, then lower blending factor values such as 2/3, 1/2, or 1/3 may form an optimal choice, thereby preserving some of the benefits of proportional interpolation and some of the benefits of equal average interpolation. All blend factor values in the range of 0 to 1 are generally useful because for any given B frame, a particular value in this range proves optimal.

For wide dynamic range and wide contrast range images, the blending factor may be determined regionally, depending on the local region features. However, generally a wide range of lighting and contrast suggests that the blending factor values support full scale interpolation, rather than equal average interpolation.

The optimal blend factor is typically determined empirically, although experience with a particular type of scene may be used to create a table of blend factors by scene type. For example, the determination of image change characteristics may be used to select a blending ratio for a frame or region. Alternatively, the B-frame may be encoded using a number of candidate blend factors (for the entire frame or region), each of which is evaluated to optimize image quality (e.g., determined by the highest signal-to-noise ratio, or SNR) and the lowest number of bits. The evaluation of these candidates can then be used to select the best value for the mixing ratio. A combination of image variation characteristics and quality/efficiency of encoding may also be used.

B frames near the middle of the B frame sequence or B frames from low values M are not affected much by the proportional interpolation because the calculated proportion is already close to the equal average. However, for higher values of M, the extreme values of the B frame positions may be significantly affected by the choice of the blending factor. Note that the blend factor may be different for these extreme positions, which use the average more than more central positions (which gain little or no benefit from deviating from the average) because they already have a high proportion of adjacent P (or I) frames. For example, if M is 5, the first and fourth B frames may use a blending factor F that blends more equal averages, but the second and third B frames may use the exact equal average ratio of 2/5 and 3/5. If the proportional-to-average blend factor (pro-to-average blend factor) varies as a function of the position of the B frames in the sequence, the varying value of the blend factor may be conveyed in the compressed bitstream or as side information to the decoder.

If a static general blend factor is required (due to the lack of a method to transmit the value), the value 2/3 is typically near optimal and can be selected in the encoder and decoder as the static value for B frame interpolation. For example, using a blend factor of F-2/3, for consecutive frames of M-3, the proportions would be 7/18 (7/18-2/3-1/3 + 1/3-1/2) and 11/18 (11/18-2/3-2/3 + 1/3-1/2).

Linear interpolation

The video frame pixel values are typically stored in a specific representation that maps the original image information to digital values. Such a mapping may result in a linear or non-linear representation. For example, the luminance values used in compression are non-linear. The use of various forms of non-linear representation includes logarithmic, exponential (various powers), and exponential with black correction (typically used for video signals).

The non-linear representation is acceptable over a narrow dynamic range or for interpolation of neighboring regions, since these neighboring interpolation represent piecewise linear interpolation. Thus, small changes in luminance can be reasonably approximated by linear interpolation. However, for wide variations in brightness, such as occurs in wide dynamic range and wide contrast range images, processing non-linear signals to be linear will be inaccurate. Even for normal contrast range images, linear fade-outs and cross-dissolves can be reduced in effect by linear interpolation. Some fades and cross-dissolves utilize non-linear fade and dissolve rates, adding further complexity.

An additional improvement to the use of proportional mixing, or even simple proportional or equal average interpolation, is to do this interpolation on pixel values represented in a linear space, or in other optimized non-linear spaces different from the original non-linear luminance representation.

This can be done, for example, by first converting two non-linear luminance signals (from the preceding and subsequent P (or I) frames) into a linear representation, or a different non-linear representation. Proportional blending is then applied, which, after application of the inverse transform, produces a blended result in the original non-linear luminance representation of the image. However, the scaling function has been performed on a more optimal representation of the luminance signal.

In addition to luminance, it is also useful to apply this linear or non-linear transformation advantageously to color (chroma) values when the color is fading or becoming more saturated, as occurs in contrast changes associated with changes in haze and cloudy days.

Exemplary embodiments

FIG. 4 is a flow chart illustrating an exemplary embodiment of the invention as a method that may be computer implemented:

step 400: in a video image compression system, for direct and interpolated modes of computing B frames, interpolation values derived from at least two non-bidirectionally predicted frames bracketing such a sequence input from a source (e.g., a stream of video images) are determined using (1) a frame distance scale, or (2) a mixture of equal weight and frame distance scale, to apply to each pixel of the input sequence of two or more bidirectionally predicted intermediate frames.

Step 401: the interpolation values are optimized with respect to image units, such as groups of pictures (GOPs), sequences of frames, scenes, frames, regions within frames, macroblocks, DCT blocks, or similar useful groupings or selections of pixels. The interpolation value may be set statically for the entire encoding period or dynamically for each image unit.

Step 402: the interpolation values are further optimized with respect to scene type or coding simplicity. For example, the interpolation value may be set as: statically (e.g., 2/3 scale and 1/3 etc. average); proportionally, frames that are near equal average, but have an equal average blended around neighboring P (or I) frames; dynamically, based on overall scene characteristics, such as fade-outs and cross-dissolves; dynamically (and locally), based on local image region characteristics, such as local contrast and local dynamic range; or dynamically (and locally) based on coding performance (e.g., highest coded SNR) and the minimum number of coded bits generated.

Step 403: the appropriate proportional amount is transmitted to the decoder if not statically determined.

Step 404: optionally, the luminance information is converted for each frame into a linear or alternatively a non-linear representation and this alternative is passed to the decoder, if not statically determined.

Step 405: the scaled pixel value is determined using the determined interpolated value.

Step 406: if necessary (due to step 404) and converted into the original representation.

Extended P frame reference

As described above, in the related art MPEG-1, 2 and 4 compression methods, a P frame refers to a previous P or I frame, and a B frame refers to a nearest previous or subsequent P and/or I frame. The same technique is used for the h.261 and h.263 motion compensated DCT compression standards, which incorporate low bit rate compression techniques.

In the developing h.263+ + and h.26l standards, B-frame references are extended to point to P or I frames that do not directly wrap around the current frame. That is, a macroblock within a B frame may point to a P or I frame before a previous P frame, or to a P or I frame after a subsequent P frame. Skipping of previous or subsequent P frames can be simply indicated, since each macroblock has one or more bits. Conceptually, using the previous P frame for reference in B only requires storage space. For low bit rate coding usage of h.263+ + or h.26l, this is a small amount of extra storage. For subsequent P frame references, the P frame coding order must be modified relative to B frame coding so that future P frames (or possibly I frames) must be decoded before intervening B frames. Therefore, coding order is also a problem for subsequent P frame references.

The main differences between P-frame and B-frame types are: (1) b frames can be bi-directionally referenced (up to two motion vectors per macroblock); (2) b frames are discarded after use (which also means that they can be skipped during decoding to provide temporal layering); and (3) P frames are used as "jumpers", one to the next, because each P frame must be decoded to be used as a reference for each subsequent P frame.

As another aspect of the invention, P frames (as opposed to B frames) are decoded from one or more previous P or I frames (excluding the case where each P frame only references the nearest previous P or I frame). Thus, for example, two or more motion vectors per macroblock may be used for the current P frame, all pointing backward in time (i.e., to one or more previously decoded frames). Such P-frames still retain the "springboard" feature. Fig. 5 is an example showing multiple previous references, two previous P frames 502,504 and a previous I frame 506, referenced by a current P frame 500.

Furthermore, the concept of macroblock interpolation can be applied in such P frame references, as described above. Thus, instead of representing a single reference to more than one previous P or I frame, one motion vector may be used for each such frame reference to blend the proportions of multiple previous P or I frames. For example, the technique described above using a B frame interpolation mode with two frame references may be applied to allow any macroblock in a P frame to reference the previous two P frames or one previous P frame and one previous I frame with two motion vectors. This technique interpolates between two motion vectors, but is not bi-directional in time (as is the case with B-frame interpolation), since both motion vectors point backward in time. The storage overhead is reduced to a point where it is quite practical to keep the previous P or I frames in memory for such concurrent reference.

In applying such P frame interpolation, it is certainly necessary to select and inform the decoder of various useful proportions of the first two or more P frames (and optionally, one previous I frame). In particular, equal blending of frames is one of the useful blending ratios. For example, with the first two P frames as references, equal 1/2 for each P frame may be blended. For the first three P frames, an equal blend of 1/3 may be used.

Another useful mix of two P frames is the nearest previous frame of 2/3 and the farthest previous frame of 1/3. For the first 3P frames, another useful blend is the nearest previous frame of 1/2, the previous frame in the middle of 1/3, and the farthest previous frame of 1/6.

In any case, a simple set of useful blends of multiple previous P frames (and optionally, one I frame) may be utilized and simply signaled from the encoder to the decoder. The particular blending ratio used may be selected whenever useful for optimizing the coding efficiency of the image unit. A large number of blending ratios can be selected using a small number of bits, which can be transmitted to the decoder as long as they are suitable for the desired picture unit.

As another aspect of the invention, it is also useful to transition the selection of a single P frame reference from the nearest previous P (or I) frame to a more "distant" previous P (or I) frame. In this manner, a P frame would utilize a single motion vector for each macroblock (or alternatively, for each 8x8 block in MPEG-4 mode encoding), but would utilize one or more bits to indicate that the reference refers to a single particular previous frame. The P frame macroblocks in this mode are not interpolated, but reference a selected previous frame selected for reference from possibly two, three or more previous P (or I) frames. For example, a 2-bit encoding may specify one of up to 4 previous frames as a single selected frame. This 2-bit encoding can be changed in any convenient picture unit.

Adaptive number of B frames

Fixed patterns of the I, P, B frame type are typically used in MPEG encoding. The number of B frames between P frames is typically constant. For example, it is typical in MPEG coding to use two B frames between P (or I) frames. Fig. 6A is a diagram of a typical prior art MPEG-2 encoding scheme showing a fixed number of B frames (i.e., two) between bracketing I frames 600 and/or P frames 602.

The MPEG-4 video coding standard conceptually allows for a variable number of B frames between bracketing I frames and/or P frames, as well as varying distances between I frames. Fig. 6B is a diagram of a theoretically possible prior art MPEG-4 video coding pattern, showing a variable number of B frames between bracketing I frames 600 and/or P frames 602, and a variable distance between I frames 600.

This flexible coding structure can theoretically be used to improve coding efficiency by matching the most efficient B-frame and P-frame coding types to the moving image frames. While this flexibility has been particularly allowed, it has been rarely studied and the mechanisms by which the positions of B and P frames are actually determined in such a flexible structure are not known.

Another aspect of the present invention applies the concepts described herein to this flexible coding structure and simple fixed coding modes in common use. B frames can therefore be interpolated using the methods described above, while P frames can refer to more than one previous P or I frame and be interpolated according to the present description.

In particular, macroblocks within B frames can be mixed with proportions suitable for flexible coding structures as efficiently as fixed structures. Proportional blending may also be utilized when B frames refer to more distant P or I frames than the nearest bracketing P or I frame.

Similarly, in this flexible coding structure, a P frame can reference more than one previous P or I frame, as effectively as a fixed pattern structure. Furthermore, blending ratios may be applied to macroblocks in such P frames when the P frames reference more than one previous P frame (optionally plus an I frame).

(A) Determining position in flexible coding modes

The following method allows the encoder to optimize the efficiency of the frame coding mode and the utilization of the mixing ratio. For a selected range of frames, a number of candidate encoding modes may be tried to determine the optimal or near optimal (relative to a specified criterion) mode. Fig. 7 is a diagram of a coding pattern that can be examined. The initial sequence 700 ending with a P or I frame is arbitrarily selected and used as a basis for adding additional P and/or B frames, which are then evaluated (as described below). In one embodiment, a P-frame is added to the initial sequence 700 to create a first test sequence 702 for evaluation. If the evaluation is satisfied, an intervening B frame is inserted to create a second test sequence 704. For each evaluation that is satisfied, additional B frames are inserted to create progressively longer test sequences 706-712 until the evaluation criteria become unsatisfied. At that point, the preceding coding sequence is accepted. This process is then repeated, using the ending P frame of the previously accepted coding sequence as the starting point for adding a new P frame, and then inserting a new B frame.

The optimal or near optimal coding mode may be selected based on various evaluation criteria, typically involving tradeoffs in various coding characteristics, such as the quality of the encoded image and the number of coding bits required. Common evaluation criteria include the least number of bits used (in a fixed quantization parameter test), or the best signal-to-noise ratio (in a fixed bit rate test), or a combination of both.

It is also common to minimize a Sum of Absolute Differences (SAD), which forms a measure of DC matching. The AC matching criteria are also a useful measure of the quality of a particular candidate match, as described in co-pending U.S. patent application No. 09/904192 entitled "motion estimation for video compression system," which is assigned to the assignee of the present invention and is hereby incorporated by reference (this patent application also describes other useful optimizations). The AC and DC matching criteria accumulated over the best matches of all macroblocks can be examined to determine the overall matching quality of each candidate coding pattern. When this AC/DC matching technique is used with an estimate of the number of coded bits per frame pattern type, it may improve or replace the signal-to-noise ratio (SNR) and the test that uses the fewest bits. Typically, the quantization parameter value (QP) used to encode B frame macroblocks is higher than for P frames, which affects the number of bits and quality (usually measured as signal-to-noise ratio) used within various candidate coding modes.

(B) Hybrid ratio optimization in flexible coding mode

Alternatively, for each candidate pattern determined according to the above method, the mix ratio may be tested for suitability (e.g., optimal or near optimal mix ratio) with respect to one or more criteria. This may be done, for example, by testing for best quality (lowest SNR) and/or efficiency (least bits used). The use of one or more previous references for each macroblock in a P frame may also be determined in the same manner, testing each candidate reference pattern and blend ratio to determine one or more sets of suitable references.

Once the coding mode is selected for the next step (step 700 in FIG. 7), the subsequent steps (steps 702-712) may be tested for various candidate coding modes. In this method, more efficient encoding of a moving image sequence can be determined. Thus, efficiency can be optimized/improved as described in subsection (A) above; hybrid optimization can be applied at the encoding step of each test.

DC to AC interpolation

In many cases of image coding, such as when using a logarithmic representation of an image frame, the above-described interpolation of frame pixel values will optimally encode changes in luminance. However, in alternative video "gamma curve", linear and other representations, it has generally proven useful to apply different interpolation blend factors to the DC values of the pixels rather than to the AC values. FIG. 8 is a flow chart illustrating one embodiment of an interpolation method with DC interpolation that is significantly different from AC interpolation. For selected image regions (typically DCT blocks or macroblocks) 802, 802 'from the first and second input frames, the average pixel value 804, 804' for each such region is subtracted, thus separating the DC value (i.e., the average of the entire selected region) 806, 806 'and the AC value (i.e., the remaining signed pixel value) 808, 808' in the selected region. The respective DC values 806, 806 'may then be multiplied by interpolation weights 810, 810' that are different from 814, 814 'used to multiply the AC (signed) pixel values 808, 808'. The newly interpolated DC value 812 and the newly interpolated AC value 816 may then be combined 818, resulting in a new prediction 820 for the selected region.

As with other interpolation values in the present invention, the appropriate weight can be signaled to the decoder of each image unit. A small number of bits can be selected between many encoded values and independent interpolation of the AC versus DC aspects of the pixel values is selected.

Linear & nonlinear interpolation

The interpolation is a linear weighted average. Since the interpolation operation is linear and since the pixel values in each image frame are typically represented in a non-linear form (e.g., a video gamma representation or a logarithmic representation), further optimization of the interpolation process is possible. For example, interpolation of pixels of a particular frame sequence, as well as interpolation of DC values separated from AC values, will sometimes be optimal or near optimal with a linear pixel representation. However, for other frame sequences, such interpolation will be optimal or near optimal if the pixels are represented as logarithmic values or other pixel representations. Furthermore, the optimal or near optimal representation for interpolating the U and V (chrominance) signal components may be different from the optimal or near optimal representation for the Y (luminance) signal component. It is therefore a useful aspect of the present invention to convert the pixel representation to an alternate representation as part of the interpolation procedure.

FIG. 9 is a flow chart illustrating one embodiment of a method for luminance pixel interpolation using an alternative representation. Beginning with a region or block of luminance (Y) pixels in an initial representation (e.g., video gamma or logarithmic) (step 900), the pixel data is transformed into an alternate representation (e.g., linear, logarithmic, video gamma) that is different from the initial representation (step 902). The transformed pixel region or block is then interpolated (step 906) and transformed back to the original representation (step 906), as described above. The result is interpolated pixel luminance values (step 908).

FIG. 10 is a flow chart illustrating one embodiment of a method for chroma pixel interpolation using an alternative representation. Beginning with a region or block of chroma (U, V) pixels in an initial representation (e.g., video gamma or logarithmic) (step 1000), pixel data is transformed into an alternate representation (e.g., linear, logarithmic, video gamma) different from the initial representation (step 1002). The transformed pixel region or block is then interpolated (step 1006) and transformed back to the original representation (step 1006), as described above. The result is an interpolated pixel chrominance value (step 1008).

The transformation between the various representations can be performed according to the teachings of U.S. patent application No. 09/905039 entitled method and system for improving compression and retrieval chromainformation, which is assigned to the assignee of the present invention and is hereby incorporated by reference. Note that the alternative representation transform and its inverse are typically performed using a simple look-up table.

As a variation of this aspect of the invention, the alternate (linear or non-linear) representation space for AC interpolation may be different from the alternate representation space for DC interpolation.

As with the interpolation weights, the selection of which alternate interpolation representation to use for each luma (Y) and chroma (U and V) pixel representation may be signaled to the decoder using a small number of bits per selected image unit.

Number of motion vectors per macroblock

In MPEG-2, each 16x16 macroblock in a P frame allows one motion vector. In B frames, MPEG-2 allows up to two motion vectors per 16 × 16 macroblock, corresponding to a bi-directional interpolation mode. In MPEG-4 video coding, each 16x16 macroblock in a P frame allows up to 4 motion vectors, one for each 8x8DCT block. In an MPEG-4B frame, when interpolation mode is used, a maximum of two motion vectors are allowed per 16x16 macroblock. If a subsequent corresponding P frame macroblock is set to 8x8 mode with 4 motion vectors, a single motion vector increment in MPEG-4 direct mode may result in 4 independent "implicit" motion vectors. This is accomplished by adding one motion vector increment carried in a 16x16B frame macroblock to each of the corresponding 4 independent motion vectors from the subsequent P frame macroblock after scaling the temporal distance (the B frame is temporally closer to the previous P frame or I frame reference of the P frame).

One aspect of the present invention includes the option of increasing the number of motion vectors per image area (e.g., macroblock). For example, it has sometimes proven beneficial to have more than two motion vectors per B frame macroblock. These can be applied by referring to additional P or I frames and having 3 or more interpolation terms in the weighted sum. Additional motion vectors may also be applied to allow independent vectors for 8x8DCT blocks of B frame macroblocks. Also, 4 independent deltas may be used to extend the concept of direct mode by applying an independent delta to each of the 48 x8 region motion vectors of the subsequent P-frame.

Further, the P frame may be modified using a B frame interpolation technique to reference more than one previous frame in the interpolation mode, using the two interpolation term technique of B frames described above. This technique can be easily extended to more than two previous P or I frames, with the resulting interpolation having 3 or more terms in the weighted sum.

As with other aspects of the invention (e.g., pixel representation and DC versus AC interpolation methods), a particular weighted sum may be passed to the decoder using a small number of bits per image unit.

In applying this aspect of the invention, the correspondence between the 8x8 pixel DCT blocks and the motion vector field need not be as strict as in MPEG-2 and MPEG-4. For example, it may be useful to use alternative region sizes for motion vectors other than 16x16, 16x8 (used only in interlaced scanning in MPEG-4), 8x 8. Such substitutions may include any number of useful region sizes, such as 4x8, 8x12, 8x16, 6x12, 2x8, 4x8, 24x8, 32x32, 24x24, 24x16, 8x24, 32x8, 32x4, and so forth. With a small number of such useful sizes, a few bits can inform the decoder of the correspondence between motion vector region size and DCT block size. In systems using conventional 8x8DCT blocks, a simple set of correspondences to the motion vector field is useful to simplify the processing in motion compensation. In systems where the DCT block size is not 8x8, then greater flexibility may be achieved in the specified motion vector domain, as described in co-pending U.S. patent application No. 09/545233 entitled "enhanced video and resolution layer encoding video encoding, which is assigned to the assignee of the present invention and is hereby incorporated by reference. Note that the motion vector region boundaries need not correspond to DCT region boundaries. In practice, it is often useful to define motion vector regions in such a way that their boundaries fall within (and not on) the DCT block.

The concept of extending the flexibility of the motion vector field is also applicable to the interpolation aspect of the present invention. The above interpolation method can be applied to the full flexibility of useful motion vectors using all of the versatility of the present invention, as long as the correspondence between each pixel and one or more motion vectors pointing to one or more reference frames is specified. Even when P frames are used, the size of the region corresponding to each motion vector may be different for each previous frame reference, and when B frames are used, the size of the region corresponding to each motion vector may be different for each previous and future frame reference. If the regions of the motion vectors differ in size when the improved interpolation method of the present invention is applied, the interpolation reflects the overlapping common region. The overlapping common region of the motion vector references may be used as the region over which the DC term is determined when interpolating the DC and AC pixel values separately.

Fig. 11 shows the unique motion vector region sizes 1100 and 1102 for each of two P-frames 1104, 1106. In computing the interpolation according to this invention, a union 1108 of the motion vector region sizes is determined. Union 1108 defines all regions that are considered to have motion vectors assigned.

Thus, for example, when interpolating a 4x4DCT region of the B frame 1112 back to the previous P frame 1104, the 4x4 region 1110 in the union 1108 would use the motion vector corresponding to the 8x16 region 1114 in the previous P frame. If predicted forward, the 1110 region in the union 1108 will use the motion vector corresponding to the 4x16 region 1115 in the next P frame. Similarly, backward interpolation of region 116 within union 1108 would use motion vectors corresponding to 8x16 region 1114, while forward prediction of the same region would use motion vectors corresponding to 12x16 region 1117.

In one embodiment of the present invention, two steps are used to achieve interpolation of the generic (i.e., non-uniform sized) motion vectors. The first step is to determine the motion vector common region, as described with respect to fig. 11. This establishes a correspondence between pixels and motion vectors (i.e., the number of motion vectors per specified pixel region size) for each previous or subsequent frame reference. The second step is to use a suitable interpolation method and an interpolation factor valid for each region of pixels. The task of the encoder is to ensure that the optimal or near optimal motion vector regions and interpolation methods are specified and that all pixels have their vectors and interpolation methods fully specified. For the case of a fixed pattern of motion vectors (e.g., specifying one motion vector per 32x8 blocks for the entire frame), with a single specified interpolation method (e.g., a fixed-scale blend of each distance to the referenced frame specified for the entire frame), this is very simple. This approach can also become quite complex if regional changes are made to the motion vector region size and the region size differs depending on which previous or subsequent frame is referenced (e.g., 8x8 blocks for the nearest previous frame and 32x8 blocks for the next nearest previous frame). Further, the interpolation method may be regionally specified within the frame.

In encoding, the encoder works to determine the optimal or near optimal usage of bits to select between motion vector region shape and size, and to select the optimal or near optimal interpolation method. A decision is also needed to specify the number and distance of frames that are referenced. These designations are determined by exhaustively testing a large number of candidate motion vector region sizes, candidate frames to be referenced, and interpolation methods for each such motion vector region until an optimal or near optimal encoding is found. The optimum (relative to a selected criterion) may be determined by finding the minimum SNR after the coded block or the minimum number of bits for a fixed Quantization Parameter (QP) after the coded block, or by applying other suitable measures.

Direct mode extension

The conventional direct mode used in B frame macroblocks in MPEG-4 is efficient in motion vector coding, providing the benefits of 8x8 block mode with simple common increments. Direct mode weights each corresponding motion vector of a subsequent P frame, which references a previous P frame, at a corresponding macroblock location based on temporal distance. For example, if M ═ 3 (i.e., the two intervening B frames), with simple linear interpolation, the first B frame uses-2/3 times the subsequent P frame motion vectors to determine the pixel offset relative to such P frame, and 1/3 times the subsequent P frame motion vectors to determine the pixel offset relative to the previous P frame. Similarly, the second B frame uses 1/3 times the same P frame motion vector to determine the pixel offset relative to such P frame, and 2/3 times the subsequent P frame motion vector to determine the pixel offset relative to the previous P frame. In direct mode, a small increment is added to each corresponding motion vector. As another aspect of this invention, the concept can be extended to B frame references pointing to one or more n-away P frames, which in turn refer to one or more preceding or subsequent P frames or I frames, with frame scale fractions (frame scalefractions) determined by considering frame distances.

FIG. 12 is a diagram showing a sequence of P and B frames in which interpolation weights for B frames are determined as a function of distance to a 2-away subsequent P frame that references a 1-away subsequent P frame. In this illustrative example, M-3 indicates that there are two consecutive B frames 1200, 1202 between the bracketing P frames 1204, 1206. In this example, each co-existing macroblock in the next nearest subsequent P frame 1208 may point to the intervening (i.e., nearest) P frame 1204, and the first two B frames 1200, 1202 may reference the next nearest subsequent P frame 1208 instead of the nearest subsequent P frame 1204, as in conventional MPEG. Thus, for the first B frame 1200, the frame scale fraction 5/3 multiplied by the motion vector mv from the next nearest subsequent P frame 1208 would be used as the pixel offset relative to P frame 1208, and the second B frame 1202 would use 4/3 multiplied by that same motion vector offset.

If the nearest subsequent P frame referenced by the B frame points to the next nearest previous P frame, then the simple frame distance is again used to obtain the appropriate frame scale fraction to apply to the motion vector. FIG. 13 is a diagram showing a sequence of P and B frames in which interpolation weights for B frames are determined as a function of distance from a 1-away subsequent P frame that references a 2-away previous P frame. In this illustrative example, where M is 3, the B frames 1300, 1302 reference the nearest subsequent P frame 1304, which P frame 1304 in turn references a 2-away P frame 1306. Thus, for the first B frame 1300, the pixel offset score is the frame scale score 2/6 multiplied by the motion vector mv from the nearest subsequent P frame 1304, and the second B frame 1302 would have a pixel offset of 1/6 multiplied by that same motion vector because the motion vector of the nearest subsequent P frame 1304 points 2 away from the previous P frame 1306, which is 6 frames away.

In general, in the case of a B frame referencing a single P frame in direct mode, the frame distance method sets the numerator of the frame scale fraction equal to the frame distance of that B frame to its referenced or "target" P frame, and sets the denominator equal to the distance from the target P frame to another P frame referenced by the target P frame. The sign of the frame scale fraction is negative for measurements made from a B frame to a subsequent P frame, and positive for measurements made from a B frame to a previous P frame. This simple method of applying frame distance or frame scale fraction to P-frame motion vectors enables efficient direct mode coding.

Furthermore, another aspect of this invention is to allow direct mode to be applied to interpolated motion vector references for a P-frame. For example, if a P frame is interpolated from the nearest and next nearest previous P frames, direct mode referencing according to this aspect of the invention allows an interpolation blend to be used for each multi-reference direct mode B frame macroblock. In general, two or more motion vectors of a P frame may apply a suitable frame scale fraction. Two or more frame distance modified motion vectors may then be used with corresponding interpolation weights for each B frame that references or points to that P frame (as described below) to generate interpolated B frame macroblock motion compensation.

FIG. 14 is a diagram showing a sequence of P and B frames in which a subsequent P frame has multiple motion vectors referencing prior P frames. In this example, the B frame 1400 references a subsequent P frame P3. This P3 frame, in turn, has two motion vectors mv1 and mv2, which reference corresponding previous P frames P2 and P1. In this example, each macroblock of the B frame 1400 may be interpolated in direct mode using either of the two weighting terms or a combination of such weighting terms.

Each macroblock of the B frame 1400 may be constructed in a hybrid form according to:

● corresponding pixels of frame P2 replaced by a frame scale fraction 1/3 of mv1 (which may then be multiplied by a scale weight i) plus corresponding pixels of frame P3 replaced by a frame scale fraction of mv 1-2/3 (which may then be multiplied by a scale weight j); and

● corresponding pixels of frame P1 that were replaced by a frame scale fraction 2/3(4/6) of mv2 (which may then be multiplied by some scale weight k) plus corresponding pixels of frame P3 that were replaced by a frame scale fraction-1/3 (-2/6) of mv2 (which may then be multiplied by some scale weight l).

As with all direct modes, motion vector deltas may be used with each of mv1 and mv 2.

According to this aspect of the invention, direct mode predicted macroblocks in B frames can also reference multiple subsequent P frames, using the same interpolation method and motion vector frame scale fraction application, as in the case of multiple previous P frames. FIG. 15 is a diagram showing a sequence of P and B frames in which a nearest subsequent P frame has a motion vector referencing a previous P frame, and a next nearest subsequent P frame has a plurality of motion vectors referencing previous P frames. In this example, the B frame 1500 references two subsequent P frames P2 and P3. Frame P3 has two motion vectors mv1 and mv2 that reference corresponding previous P frames P2 and P1. Frame P2 has a motion vector mv3 that references the previous P frame P1. In this example, each macroblock of the B frame 1500 is interpolated in direct mode using 3 weighting terms. If so, the motion vector frame scale fraction may be greater than 1 or less than-1.

This form of weighting for direct mode B frame macroblock interpolation can take advantage of the full generality of the interpolation described herein. In particular, each weight or combination of weights may be tested for optimal performance (e.g., quality versus number of bits) for various image elements. The interpolation fraction set for this improved direct mode can be assigned to the decoder with a small number of bits per image unit.

Each macroblock of the B frame 1500 may be constructed in hybrid form according to:

● corresponding pixels of frame P3 replaced by a frame scale fraction-5/3 of mv1 (which may then be multiplied by a scale weight i) plus corresponding pixels of frame P2 replaced by a frame scale fraction-2/3 of mv1 (which may then be multiplied by a scale weight j);

● corresponding pixels of frame P3 replaced by a frame scale fraction-5/6 of mv2 (which may then be multiplied by some scale weight k) plus corresponding pixels of frame P1 replaced by a frame scale fraction 1/6 of mv2 (which may then be multiplied by some scale weight l); and

● corresponding pixels of frame P2 that are replaced by a frame scale fraction-2/3 of mv3 (which may then be multiplied by a scale weight m) plus corresponding pixels of frame P1 that are replaced by a frame scale fraction 1/3 of mv3 (which may then be multiplied by a scale weight n).

As with all direct modes, motion vector deltas may be used with each of mv1, mv2, and mv 3.

Note that a particularly advantageous direct coding mode often occurs when the next nearest subsequent P frame references the nearest P frame that surrounds a candidate B frame.

Direct mode coding of B frames in MPEG-4 always uses the motion vectors of the following P frames as reference. According to another aspect of the invention, it is also possible that a B-frame references the motion vectors of the co-existing macroblocks of the previous P-frame, which will sometimes prove a beneficial choice for direct mode coding references. If so, the motion vector frame scale fraction will be greater than 1 when the next nearest previous P frame is referenced by the motion vector of the nearest previous P frame. FIG. 16 shows a sequence of P and B frames in which the nearest previous P frame has a motion vector referencing a previous P frame. In this example, the B frame 1600 references the 1 st distant preceding P frame P2. The motion vector mv of frame P2 references the next previous P frame P1 (far away relative to B frame 1600, 2). Suitable frame scale fractions are shown.

If the nearest previous P frame is interpolated from multiple vectors and frames, methods similar to those described in connection with FIG. 14 are applied to obtain the motion vector frame scale fraction and interpolation weights. FIG. 17 is a diagram showing a sequence of P and B frames in which a nearest previous P frame has two motion vectors referencing previous P frames. In this example, the B frame 1700 references the previous P frame P3. One motion vector mv1 of the previous frame P3 refers to the next previous P frame P2, and the second motion vector mv2 refers to the 2-away previous P frame P1. Suitable frame scale fractions are shown.

Each macroblock of the B frame 1700 may be constructed in hybrid form according to:

● corresponding pixels of frame P3 that are replaced by a frame scale fraction 1/3 of mv1 (which may then be multiplied by a scale weight i) plus corresponding pixels of frame P2 that are replaced by a frame scale fraction 4/3 of mv1 (which may then be multiplied by a scale weight j); and

● corresponding pixels of frame P3 that are replaced by a frame scale fraction 1/6 of mv2 (which may then be multiplied by some scale weight k) plus corresponding pixels of frame P1 that are replaced by a frame scale fraction 7/6 of mv2 (which may then be multiplied by some scale weight l).

When the motion vector of the previous P frame (relative to the B frame) points to the next nearest previous P frame, it is not necessary to use only the next nearest previous frame as an interpolation reference, as in fig. 16. The nearest previous P frame may prove a better choice for motion compensation. If so, the motion vector of the nearest previous P frame is shortened to the frame distance fraction from the B frame to that P frame. FIG. 18 is a diagram showing a sequence of P and B frames in which a nearest previous P frame has a motion vector referencing a previous P frame. In this example, for M-3, the first B frame 1800 would use the 1/3 and-2/3 frame distance scores times the motion vector mv of the nearest previous P frame P2. The second B frame 1802 would use 2/3 and-1/3 frame distance scores (not shown). This choice is communicated to the decoder to distinguish this example from the example shown in fig. 16.

As with all other encoding modes, the use of direct mode preferably includes testing candidate modes based on other available interpolation and one-vector encoding modes and reference frames. For direct mode testing, the nearest subsequent P frame (and optionally the next nearest subsequent P frame or even more distant subsequent P frames, and/or one or more preceding P frames) may be tested as a candidate frame, and a small number of bits (typically one or two) may be used to specify the direct mode P reference frame distance used by the decoder.

Extended interpolated values

It is specified in the MPEG-1, 2, 4 and h.261 and h.263 standards that B-frames use equal weighting of the pixel values of the forward and backward reference frames, as replaced by motion vectors. Another aspect of the invention includes the application of various useful unequal weightings that significantly improve B frame coding efficiency, and the extension of such unequal weightings to more than two references, including two or more references in a forward or backward temporal direction. This aspect of the invention also includes a method for more than one frame being referenced and interpolated for P frames. Furthermore, when two or more references point forward in time, or when two or more references point backward in time, then it is sometimes useful to use a negative weight and a weight that exceeds 1.0.

For example, fig. 19 is a frame sequence showing three P frames P1, P2, and P3, where P3 uses interpolated references with two motion vectors, and P1 and P2 are each one. For example, if a continuous change is occurring in the frame range between P1 and P3, then P2-P1 (i.e., the pixel values of frame P2 (replaced by the motion vector of P2) minus the pixel values of frame P1 (replaced by the motion vector of P1)) will be equal to P3-P2. Similarly, P3-P1 would be twice the size of P2-P1 and P3-P2. In this case, the pixel values of the frame P3 can be differentially predicted from P2 and P1 by the formula:

P3＝P1+2x(P2-P1)＝(2xP2)-P1

in this case, the interpolation weight of P3 is 2.0 for P2 and-1.0 for P1.

As another example, fig. 20 shows a frame sequence of four P frames P1, P2, P3, and P4, where P4 uses an interpolated reference with three motion vectors, and one each of P1, P2, and P3. Therefore, since P4 is predicted from P3, P2, and P1, three motion vectors and interpolation weights are applied. In this case, if consecutive changes are occurring over this frame range, P2-P1 would equal P3-P2 and P4-P3, while P4-P1 would equal 3x (P2-P1) and 3x (P3-P2).

Thus, in this example, P4 based on P2 and P1 predicts as:

p4 ═ P1+3x (P2-P1) ═ 3xP2) - (2xP1) (weights 3.0 and-2.0)

P4 predictions based on P3 and P1 are:

p4 ═ P1+3/2x (P3-P1) ═ 3/2xP3) - (1/2xP1) (weights 1.5 and-0.5)

P4 predictions based on P3 and P2 are:

p4 ═ P2+2x (P3-P2) ═ 2xP3) -P2 (weights 2.0 and-1.0)

However, it is possible that the changes closest to P4 (including P3 and P2) are more reliable predictors of P4 than the predictions including P1. Thus, by weighting each of the two items above that include P1 by 1/4, and weighting the item 1/2 that includes only P3 and P2, this would result in:

1/2(2P3-P2)+1/4(3/2P3-1/2P1)+1/4(3P2-2P1)＝

13/8P3+1/4P2-5/8P1 (weights 1.375, 0.25 and-0.625)

Accordingly, it is sometimes useful to use both weights greater than 1.0 and weights less than 0. At other times, if there is a noise-like variation from one frame to the next, a positive weighted average with a moderate coefficient between 0.0 and 1.0 may yield the best prediction for the macroblock (or other pixel region) of P4. For example, equal weighting of P1, P2, and P3 for each 1/3 in fig. 20 may in some cases form the best prediction of P4.

Note that the best matching motion vectors are used to determine regions of P1, P2, P3, etc., in this example, P1, P2, P3 are used in the calculation. In some cases, this match may be best an AC match, which allows the varying DC term to be predicted by AC coefficients. Alternatively, if DC matching (e.g., sum of absolute differences) is used, then the change in AC coefficients can often be predicted. In other cases, various forms of motion vector matching will form the best prediction with various weight blends. Generally, the best prediction for a particular situation is determined empirically using the methods described herein.

These techniques are also applicable to B-frames with two or more motion vectors pointing backward or forward in time. When pointing forward in time, the above described pattern of coefficients for a P frame is reversed to accurately predict the current P frame backwards. Using this aspect of the invention it is possible to have two or more motion vectors in both the forward and backward directions, thus predicting in both directions simultaneously. The appropriate weighted mix of these various predictions can be optimized by selecting the mix's weights that best predict the macroblock (or other pixel region) of the current B frame.

FIG. 21 is a diagram showing a sequence of P and B frames in which different P frames have one or more motion vectors referencing different preceding P frames, and also showing different weights a-e assigned to respective forward and backward references referenced by a particular B frame. In this example, the B frame 2100 references three previous P frames and two subsequent P frames.

In the example shown in fig. 21, frame P5 must be decoded in order for this example to work. It is sometimes useful to order the frames in the bitstream in the order required for decoding ("transmission order"), which is not necessarily the order of display ("display order"). For example, in a frame sequence that exhibits cyclic motion (e.g., rotation of an object), a particular P frame may more closely resemble a distant P frame than the nearest subsequent P frame. Fig. 22 is a diagram showing a sequence of P and B frames in which the bitstream of P frames is transmitted in an order different from the display order. In this example, frame P3 is more similar to frame P5 than frame P4. It is therefore useful to send and decode P5 before P4, but display P4 before P5. Preferably, when such P-frames can be dropped (e.g., at the end of n frames in bitstream order or after frame X in display order), each P-frame should inform the decoder.

If weights are selected from a small set of choices, a small number of bits can inform the decoder which weight to use. This can be signaled to the decoder once per image unit, as with all other weights described herein, or at any other point in the decoding process where weight changes are useful.

It is also possible to download a new set of weights. In this manner, a small number of weight sets may be active at a given time. This allows a small number of bits to inform the decoder which valid set of weights will be used at any given point in the decoding process. To determine the appropriate set of weights, a large number of weights may be tested during the encoding process. If a small subset is found to provide high efficiency, the decoder can be informed to use that subset. It is thus possible to inform the decoder of a particular element of the subset with only a few bits. For example 10 bits may select one of 1024 subset elements. In addition, when a particular small subset is changed to maintain efficiency, the decoder may be informed of the new subset. Thus, the encoder can dynamically optimize the number of bits required to select among the weight set elements versus the number of bits required to update the weight set. In addition, a small number of short codes can be used to represent common useful weights, such as 1/2, 1/3, 1/4, and so forth. In this manner, a small number of bits can be used to represent the weight set, for example for K-forward-vector prediction in P frames (where K is 1, 2, 3 …), or K-forward-vector and L-backward-vector prediction in B frames (where K and L are selected from 0, 1, 2, 3 …), or K-forward-vector and L-backward-vector prediction in P frames (where K and L are selected from 0, 1, 2, 3 …), as a function of the current M value (i.e., the relative position of the B frame with respect to the neighboring P (or I) frames).

Fig. 23 is a diagram showing a sequence of P and B frames with assigned weights. The B frame 2300 has weights a-e whose values are assigned from a table of B frame weight sets 2302. The P frame 2304 has weights m and n, the values of which are assigned from a table of P frame weight sets 2306. Some weights may be static (i.e., downloaded to the decoder unchanged) and signaled by the encoder. Other weights may be dynamically downloaded and then notified.

This same technique can be used to dynamically update the weight set to select DC interpolation versus AC interpolation. Furthermore, encoded values may be represented which select a general (linear) interpolation (interpolation of pixel values, typically represented in a non-linear representation) versus a linear interpolation of transformed values (in an alternative linear or non-linear representation). Similarly, such encoded values may indicate which such interpolation applies to either the AC or DC values or whether to separate the predicted AC and DC portions.

The valid subset may also be used to minimize the number of bits necessary to select between the currently used sets of weight coefficients. For example, if 1024 downloaded weight sets are stored in the decoder, then perhaps 16 may need to be active during a particular portion of a frame. Thus, by selecting which subset of the 16 weight sets (out of 1024) is valid, only 4 bits need be used to select which of the 16 weight sets is valid. These subsets may be represented using short codes for these most common subsets, thus allowing a small number of bits to be selected among the common subsets.

Softening and sharpening

As well as simply separating the DC component from the AC signal by subtracting the average value, other filtering operations are possible during motion vector compensated prediction. For example, various high-pass, band-pass, and low-pass filters may be applied to a pixel region (e.g., a macroblock) to extract various frequency bands. These frequency bands can then be modified when motion compensation is performed. For example, it may often be useful to filter out the highest frequencies on a noisy moving image to soften (make it less sharp or slightly blurred) the image. A more flexible image pixel, combined with a steeper diagonal matrix (tiltmtrix) for quantization (the steeper diagonal matrix ignores higher frequency noise in the current block), generally results in a more efficient encoding method. It has been possible to represent the changes in the quantization ramp matrix for each image cell. It is also possible to download a custom tilt matrix for luminance and chrominance. Note that the efficiency of motion compensation can be improved regardless of whether the tilt matrix is changed. However, it is often most efficient to change both the tilt matrix and the filter parameters applied in the motion compensation process.

It is common practice to use reduced chroma coding resolution and chroma specific diagonal matrices. However, in this example, the chroma coding resolution is static (e.g., 4:2:0 horizontally and vertically encodes half resolution or 4:2:2 encodes half resolution only horizontally). Coding efficiency may be improved according to this aspect of the invention by applying a dynamic filter process to chrominance and luminance (independently or cooperatively) during motion compensation, one selected per picture element.

U.S. patent application No. 09/545233 (cited above) entitled "enhanced doppler resolution layer approximation filter" describes the use of an improved negative half-cycle (truncated sine function) displacement filter. These filters have the advantage that they maintain sharpness when performing fractional-pixel portions of motion vector displacements. At integer pixel (integer pixel) shift points and fractional pixel shift points, certain macroblocks (or other useful image areas) may be more optimally shifted using filters that reduce or increase their sharpness. For example, for "zoom" (where some objects in a frame move out of focus and other parts of the frame come into focus over time), the transition is one of a change in sharpness and softness. Therefore, a motion compensation filter capable of improving sharpness in some regions of an image and reducing sharpness in other regions can improve coding efficiency. In particular, if a region of the image is moving out of focus, it may be beneficial to reduce sharpness, which will soften the image (and thus may produce a better match) and reduce grain and/or noise (and thus may improve coding efficiency). If a region of the image is coming into focus, it may be beneficial to maintain the best sharpness or even improve sharpness using larger negative half-cycle sine filter values.

Chroma filtering can also benefit from increasing and decreasing sharpness during encoding. For example, many coding efficiency advantages of 4:2:0 coding (coding half-resolution chroma horizontally and vertically) may be achieved by maintaining full resolution in the U and/or V channels using a more flexible motion compensation filter for chroma. Only when the color detail in the U and V channels is high, it is necessary to select the sharpest displacement filter; a softer filter would be more beneficial where there is high color noise or particles.

In addition to the change in focus, direction and motion blur change from one frame to the next are also common. Even a simple dialog scene changes significantly in motion blur from one frame to the next when the rate of motion picture film frames is 24 fps. For example, the top lip may be blurry in one frame and sharper in the next frame, due entirely to lip motion during the open shutter in the camera. For such motion blur it would be beneficial to have not only sharpening and softening (blurring) filters but also the direction of sharpening and softening during motion compensation. For example, if the direction of motion can be determined, softening or sharpening in that direction can be used to correspond to movement or stopping of an image close-up. The motion vectors used for motion compensation may themselves provide some useful information about the amount of motion, the change in the amount of motion (i.e., motion blur) for a particular frame (or region within a frame) relative to any surrounding frame (or corresponding region). In particular, the motion vector is the best motion match between P frames, while the motion blur comes from motion in the frames during the open shutter time.

FIG. 24 is a plot of object position versus time within a frame. The shutter of the camera is only open during part of the frame time. Any motion of the object causes blurring when the shutter is open. The amount of motion blur is represented by the amount of position change during the shutter opening. Thus, the slope of the position curve 2400 when the shutter is open is a measure of motion blur.

The amount of motion blur and direction of motion can also be determined from a combination of sharpness measures, surrounding motion vectors (where image regions match), feature smear detection, and human-aided specification of frame regions. The filter may be selected based on the determined amount of motion blur and the direction of motion. For example, the mapping of various filters may be determined empirically versus the determined amount of motion blur and direction of motion.

Such intelligently applied filters can significantly improve compression coding efficiency when combined with other aspects of the invention. A small number of such filters can be selected, with a small number of bits to inform the decoder. Again, this can be done once per picture element or at other useful points in the decoding process. As with the weight sets, a set of dynamically loaded filters can be used, along with an efficient subset mechanism to minimize the number of bits required to select between the most beneficial sets of filter parameters.

Implementation of

The invention may be implemented in hardware or software, or a combination of both (e.g., a programmable logic array). Unless otherwise specified, the algorithms included as part of the invention are not inherently related to any particular computer or other apparatus. In particular, various general-purpose devices may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus (e.g., an integrated circuit) to perform a particular function. Thus, the invention may be implemented in one or more computer programs executing on one or more programmable computer systems each comprising at least one processor, at least one data storage system (including persistent and volatile memory and/or storage elements), at least one input device or port, and at least one output device or port. Program code is applied to input data to perform the functions described herein and generate output information. The output information is applied to one or more output devices in a well-known manner.

Each such program may be implemented in any desired computer language (including machine, assembly, or high level procedural, logical, or object oriented programming languages) to communicate with a computer system. In any case, the language may be a compiled or interpreted language.

Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special purpose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein. The inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, the storage medium so configured causing a computer system to operate in a specific and predefined manner to perform the functions described herein.

Various embodiments of the present invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. For example, some of the steps described above may be order independent, and thus may be performed in a different order than described. Accordingly, other embodiments are within the scope of the following claims.

Claims

1. A video image compression system comprising a decoder configured to process a sequence of predicted frames and bi-predicted frames, each frame comprising pixel values arranged in macroblocks, wherein at least one macroblock within a bi-predicted frame is determined using direct mode prediction based on motion vectors from two or more predicted frames following the bi-predicted frame in display order, and wherein the motion vector of the at least one macroblock is determined by applying a motion vector frame scale fraction less than-1.

2. A video image compression system comprising a decoder configured to process a sequence of predicted frames and bi-predicted frames, each frame comprising pixel values arranged in macroblocks, wherein at least one macroblock within a bi-predicted frame is determined using direct mode prediction based on motion vectors from two or more predicted frames that precede the bi-predicted frame in display order, and wherein the motion vector of the at least one macroblock is determined by applying a motion vector frame scale fraction greater than 1.