GB2547947A

GB2547947A - Video encoding

Info

Publication number: GB2547947A
Application number: GB1603782.2A
Authority: GB
Inventors: Mrak Marta; Naccari Matteo; Zupancic Ivan; Izquierdo Ebroul
Original assignee: British Broadcasting Corp
Current assignee: British Broadcasting Corp
Priority date: 2016-03-04
Filing date: 2016-03-04
Publication date: 2017-09-06
Also published as: GB201603782D0

Abstract

A method of video encoding where, in a single pre-encoding pass: at least some blocks of a video region are encoded with either a quantisation step Q or a predictor varying between the blocks. A coding rate R is measured for each block to provide a set of data points and the set of data points is used to derive a parametrised function which is used to select the predictor and/or the quantisation step. The method may also comprise determining a distortion for each block and using the rate and the distortion to derive the parameterised curve. The parametrized function may be a quadratic or logarithmic function. The method may be used to assess spatial or temporal predictors.

Description

VIDEO ENCODING FIELD OF THE INVENTION

This invention is related to video oompression and deoompression systems, notably to a method and apparatus to reduoe the oomplexity assooiated with the searoh for the best enooding parameters to maximize the oompression effioienoy for a given video oontent.

BACKGROUND TO THE INVENTION

This invention is direoted to the video oompression area whioh aims to reduoe the bitrate required to transmit and store video oontent while at the same time maintaining an aooeptable visual quality. Data oompression is aohieved by exploiting three types of redundancies present in video signals.

The first type is spatial redundanoy and relates to pixels having similar intensity values in a given image region. Spatial redundanoy oan be exploited with two ooding tools whioh can be combined together to improve data reduotion. The first tool is spatial prediction where a pixel intensity value is predioted by the intensity value of a set of pixels looated in its spatial neighbourhood. The other tool is frequenoy transformation whioh represents a group of image pixels with some ooeffioients. The seleoted transform should have energy oompaotion properties so that a limited number of signifioant ooeffioients oan be used to represent image pixels.

The seoond type of redundanoy is temporal and related to pixels in a given image area not ohanging their values in temporally adjaoent video frames. This redundanoy oan be exploited by predioting the value of eaoh pixel in a given region by using the pixel in the same region but in a different frame. Moreover, rather than limiting to having predictors ooming from the same region from temporally adjaoent frames, some spatial displacement can be used so that moving objects can be tracked and ultimately better predicted. Also, non-temporally adjaoent frames oan be oonsidered to prediot periodio motion.

Onoe a prediotor (spatial or temporal) is available, it is subtraoted from the pixel values and a frequenoy transform is applied over the residuals. Data reduotion takes place by scaling or discarding some transform coefficients. This latter process is called quantisation and may consist in dividing each coefficient by a quantity called quantisation step and then rounding the result to the nearest integer value. The quantised values are called reproduction levels.

The third type of redundancy is called inter symbol redundancy and is related to the frequency of a symbol appearing in a given string of data. The rationale is that more frequent symbols will be represented with shorter binary coding words. The mapping between symbols and binary code words is applied over the aforementioned reproduction levels obtained after quantisation.

Usually, compression of data exploiting temporal redundancy is called inter coding, intra coding for spatial redundancy and entropy coding for inter symbol redundancy.

Video compression standards such as those belonging to the ISO/IEC MPEG or ITU-T VCEG families specifies inter, intra and entropy coding tools. For each tool different coding modes can be specified. In some frames of the video sequences these tools are all available and therefore a video encoder should select the best tool to be used to maximise the coding efficiency. Moreover, a different quantisation step can be specified for each frame or even image area. Therefore the encoder has also to take into account this additional degree of freedom when maximising the coding efficiency.

The selection of the best coding configuration for a given image area, i.e. coding tool, coding mode and quantisation step, can be performed by testing all possibilities and for each one of them computing the coding rate and/or associated distortion. These two quantities can be combined in a cost function representing the coding efficiency which is then optimised by the encoder.

While this brute force approach can provide the optimal coding performance it involves a significant amount of complexity which may not be affordable in some video coding applications. This complexity gets even higher when considering also the dependencies between choices made on different frames and/or image areas. The overall search space associated with all possible coding modes and possible interactions can grow significantly and techniques to reduce the number of possibilities tested and limit the computational complexity of the encoder, would be useful.

SUMMARY OF THE INVENTION

This invention aims to reduce the encoder complexity associated with the selection of the best coding configuration. One example of coding configuration may consist in the selection of a coding mode and quantisation step for a given image area. Accordingly, practical encoders select these two parameters by minimizing a cost C given as;

C = D + λ* R where D denotes the distortion measure to quantify video quality, R is the number of bits spent to encode the image area and / is a constant which controls the weight given to the number of coding bits. For each coding mode and quantisation step the cost C is evaluated and the combination of coding mode and quantisation parameter which minimises the cost C, is selected as the best. Video coding standards usually define several coding modes for intra and inter coding plus values for the quantisation step. Therefore considering all the possible pairs [coding mode, quantisation step] may increase significantly the complexity associated with the selection of the best coding configuration. This complexity can also increase when also dependencies among different image regions are considered. More precisely, the minimisation of cost C may usefully also take into account how the choice of the coding configuration for one region may influence the choice for subsequently coded regions and the overall coding performance of the video codec. As an example, it is well known by those skilled in the art that for inter coding, quantising one image region with a lower quantisation step would lead to a better quality of the reconstructed pixels which can then be used for prediction, leading to a lower energy residuals. In this case the number of additional bits spent when coding that image region can be saved later when compressing subsequent regions. By considering all possible coding configurations, including the dependencies among different image regions leads to a N-dimensional search space which can be practically impossible to explore. A subsampling of this search space may reduce the number of possibilities to test. Moreover, if the subsampling is performed in a way that the most relevant coding configurations are retained, a good approximation of the best coding configuration can still be found. In another example, the quantisation step can be optimised so that the coding efficiency is maximised under a given constraint on the coding rate. More precisely, different quantisation step values can be used in different image regions and/or frames. The selection of the best quantisation step can be done either by minimising the cost C above or according to some relationships between the rate allocated for a given frame or image region and the quantisation step. Usually these relationships are parametrised with respect to some features associated with the video content being compressed. One example of rate - quantisation step relationship is a quadratic model as follows: R = ki/Q + k2/Q^ where constants k^ and k2 depend on the current video sequence. To derive these parameters a given image region should be coded with different quantisation steps, say Qi, Q2 and Q3, so that the model rate - quantisation step can be fitted and the quantisation step value can be computed even for rate values different from the ones associated with .Qs. It will be understood that the more quantisation step values are tested the more accurate the curve associated with the rate - quantisation step model will be.

The same procedure of fitting the model curve from different quantisation step data can be repeated to find the best quantisation step for a set of video frames. Generally in video codecs frames are combined in the so-called Group of Pictures (GOP) whereby the selection of the best quantisation step can be optimised by considering all possible combinations of quantisation step values over all frames belonging to the GOP. This kind of optimisation is usually performed for rate control purposes where a given coding rate is assigned to the current GOP and the encoder has to maximise the video quality under this rate constraint. As described above, testing all possible combinations may involve a high amount of complexity which may not be afforded in practical implementations. The same rate - quantisation step relationship can be used to determine the amount of quantisation to be applied to each frame. However, the aforementioned problem of fitting this relationship over some real data computed from the actual GOP still exists. Therefore also in this case, low complexity techniques should be devised to limit the encoder complexity and meet application requirements such as low delay or power consumption. Besides maximising the video quality for a given target rate, the same considerations can be applied when a given target video quality is set and then the encoder has to minimise the coding rate. In this case, the encoder can use a first relationship between quality and rate and then the aforementioned quantisation step - rate. The first relationship is used to determine the rate needed to achieve a given target quality while the second is employed to compute the quantisation step to be applied to achieve that coding rate. Both relationships are usually parametrised with respect to the sequence characteristics and, as explained above, some encodings should be run in order to collect the data to fit the model associated with each relationship. Running multiple encodings to derive the actual points in the relationship curves usually involves a high amount of complexity.

To summarise, when the video quality should be maximised for a given target rate or the coding rate should be minimised for a given target video quality, the encoder can either test all possible coding modes and quantisation parameters or use relationships between coding parameters and rate to derive the value for these parameters to be used. The first case involves the full computational complexity since all possible values for coding parameters are tested. The second approach can limit the computational complexity but the used relationships rely on models which need to be initialised according to the characteristics associated with the content. Therefore, multiple encodings with a subset of coding parameter values should be performed as well in order to fit these models. This invention describes a method which requires only one encoding with a subset of coding parameters to initialise the models used in the aforementioned relationships.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 shows an example of Group Of Pictures (GOP) where the selection of the best coding configuration is performed.

Figure 2 shows an example of rate - quantisation step relationship.

Figure 3 shows one example of pattern used to initialise the model associated with the rate - quantisation step.

Figure 4 shows a generalisation of the pattern used to initialise the model associated with the rate - quantisation step.

Figure 5 shows an example of video quality - rate relationship.

Figure 6 shows one example of pattern used to initialise the model associated with the video quality - rate relationship.

Figure 7 shows one example of a pattern used to initialise the model associated with the coding mode and quantisation step - rate relationship.

Figure 8 shows an example of 3D surface when the coding mode and quantisation step determine a relationship with the coding rate.

Figure 9 shows one example of propagation of pixels associated with one coding block through subsequent frames.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

The present invention is now described in detail by a way of examples. Figure 1 shows a possible arrangement for the pictures to be encoded; such an arrangement is referred to as Group Of Picture (GOP). There is also shown the coding order of the frames, which is of course different from the display order. The arrows in the figure show the direction of prediction, the tip of each arrow identifying the frame used as a reference for prediction. It is worth noting the temporal distance of the reference frames for each of the frames in a GOP; it is appropriate to group frames based on the temporal distance of the reference frames since the effectiveness of the prediction can vary greatly based on that distance. In our example we can identify 4 groups: POC8, POC4, {POC2+POC6}, {POC1+POC3+POC5+POC7}. A GOP defines a hierarchy with temporal levels among different frame groups. A temporal level in a hierarchy is defined by the amount of reference frames in that GOP needed to decode a frame in that level. Given the example in Figure 1, frames belonging to group POC8 are in the highest temporal level of the hierarchy since only POCO frames are needed for decoding. Frames belonging to POC1+POC3+POC5+POC7 are in the lowest temporal level of the hierarchy since their decoding depends on the availability of POCO, POC8, POC4 and POC2+POC6 frames. These different levels of hierarchy in the GOP also define a so-called bitrate profile which identifies how many bits are spent for each picture. Generally, pictures in the low levels of the hierarchy will consume less bits since their associated reference frames are temporally closer frames than frames in the high levels of the hierarchy where the reference frames are more distant in time. However, the actual distribution and relative amount of coding bits spent in each frame is content dependent. The information associated with the bitrate profile can be efficiently exploited by a rate control algorithm which aims to maximise the video quality given a target rate. By knowing how many bits are required by each frame, the available bit budget can be distributed accordingly and the quantisation step can be obtained by using a rate - quantisation step relationship.

This relationship is usually quantified by a model whose parameters are content dependent and need to be frequently initialised, especially if a scene change happens in the video content being encoded. As an example, the quadratic model depicted in Figure 2 can be used as follows:

where R and Q denote the coding rate and quantisation step while constants and k2 depend on the video content. As stated above, to compute the value for ki and k2 some pairs (R, Q) should be available so that the constants can be estimated by using least square methods for example. To obtain each pair (R, Q) a GOP should be compressed with a given quantisation step Q and the coding rate measured afterwards. The accuracy of the parameters ki and k2 depend on the number of pairs (R, Q) available where usually the more the pairs the more accurate values for ki and k2. However performing multiple encodings can be problematic because of the amount of complexity and associated delay.

The method described in this invention allows to derive a given number of (R, Q) pairs using a single encoding pass. It will be understood by the embodiments of this invention that different relationships between quantisation step and rate can be considered and the aforementioned quadratic one is only an example.

Thus, instead of multiple encoding passes each using a different quantisation step, a single encoding pass is used with multiple quantisation step values arranged appropriately. Figure 3 shows one pattern that can be used to arrange multiple quantisation step values, in this example the quantisation step values Qi, Q2, Q3 and Q4. Each square in the figure denotes an image area which can be equivalent to a macroblock in the AVC standard or to a Coding Tree Unit (CTU) for HEVC. The pattern of four quantisation step values in Figure 3 is repeated over a grid of nonoverlapping blocks (i.e. macroblocks or CTUs) which extend on the whole video frame. The encoding is therefore performed with the quantisation step values Qi, ..., Q4 and the rate for each block is measured. When the encoding for the current GOP is finished, all the measured coding rates are grouped with respect to the quantisation step value of the associated block. Overall, there will be four (R, Q) pairs which can be then used to apply least square and compute the estimate for constants and k2.

The pattern in Figure 3 can be generalised as depicted in Figure 4. In this case more (R, Q) can be used for the least square fitting. Whilst the examples in Figure 3 and Figure 4 consider only four different values for Q, it will be understood by the embodiments of this invention that a lower or higher number of different quantisation steps can be used. Preferably the multiplicity of the quantisation step should lie in the range 3 to 10 inclusive.

The values for Qi, Q2, Qs and Q4 should be selected to maximise the trade-off between covering the spectrum of coding rate values and the efficiency of the coding process. To clarify this statement let us consider the following example where (Qi, Q2, Qs, Q4) are equal to (Δ, 2Δ, 3Δ, 4Δ) being Δ denoting the minimum quantisation step value. These values cover a wide range of quantisation step values and the constants computed would permit a rate - quantisation step relation which spans a wide range of coding rates. However, given that the quantisation step of neighbour blocks varies significantly the coding efficiency can be compromised. In fact, the video quality associated with the GQP compressed with these highly varying quantisation steps can vary significantly across the frame which translate in a lower quality of the predictors that will be used for intra- and inter-coding. Collecting (R,Q) pairs associated with a limited coding efficiency may lead to erroneously estimate constants ki and k2 and therefore the whole model for the relationship. As a general rule, highly varying quantisation steps can be used when a rough and initial estimate of the constants is needed: in the case of rate control one example can be the initial estimate of the quantisation step to be used on the first frame of sequence. Slightly varying quantisation step values can be used when an accurate estimate of model parameters is needed but bounded to a limited range of coding rates.

For example, if a quantisation step Qmitiai is selected as an initial estimate, the values Qi, Q2, Qs. Q4 may be selected to be relatively closely spaced around Qmitiai·

It should be noted that the pattern in Figure 3 to derive the model parameter effectively operates a subsampling of the search space associated for all combinations of coding rates and quantisation step values. In order to collect meaningful data, the subsampling should not prefer any particular image area. With a regular grid as the one depicted in Figure 3, it is guaranteed that the statistics are collected without preferring any particular image area. Alternatively, a spatial random distribution of the values for the quantisation step can be used to then collect the data associated with rate and quantisation step. Having data not coming from particular image areas can effectively be guaranteed by the random or pseudo random distribution of the sampling points in the spatial dimension.

It will be understood that if the video sequence has different moving objects or is composited from different sources, then different sets of parameters may be needed to initialise the model for a given object. One example of multiple objects in the scene may be conversational video sequences where one object is the background while one object is the person speaking in the foreground. In this case, the quantisation step - rate relationship can be different for each object and the sampling can consider the presence of these two objects and be used differently for background and foreground. The separation between foreground and background can be done with any segmentation algorithm known by those skilled in the art. Other separations, such as picture in picture, logos or based on the rate-distortion performance when the region is compressed, may be already known or used.

The same pattern in Figure 3 or Figure 4 can be used to minimise the coding rate for a given target quality. Accordingly, Figure 5 depicts a relationship between the video quality - rate. Assuming that the video quality (VQ) is measured as the inverse of the distortion D between the original and the compressed video, i.e.:

VQ = 1 / D

If the distortion is the mean square error, then the video quality - rate relationship can be modelled as follows: R = k*ln(v*v * VQ) where v denotes the variance of the original video sequence and k is a constant which depends on the actual video content. The selection of the best coding mode which minimises the coding rate for a given value of VQ is given by minimising cost C defined as follows:

C = D + λ* R where λ is called Lagrangian multiplier. The minimisation of C is obtained by computing its derivative and setting it to zero. From the formulation of R given above, it should be noted that this model is parametrised again.

To estimate k, some encodings should be run with different coding modes specified by the considered compression technology. As for the case of rate control, also here multiple encodings will introduce delay in the coding process and increase complexity. The pattern in Figure 3 can be used where this time what is varying across different blocks is the coding mode. Accordingly, using coding modes Mi, M2, M3 and M4 as indicated in Figure 6, the encoding can be run and the coding rate associated with the blocks related to each of the coding modes considered can be measured along with the video quality VQ. In this case there will be four pairs (R, VQ) which can then be used to compute constant k via least mean square techniques. A plot such as that shown in Figure 5 can be used to select an appropriate rate for a given acceptable level of video quality. In accordance with the invention, that rate can be used with reference to Figure 2 and the parameters k^ and k2 to select a quantisation step.

The value of λ calculated in accordance with the invention can be used in the conventional manner to select a coding mode for each block after testing each mode for each block.

It will be understood that the use of the sampling patterns in Figure 3 and 6 can be combined to find the best coding mode M and quantisation step Q. Accordingly, the pattern in Figure 7 can be used where this time the variables used are pairs quantisation step and coding mode (Q, M). For each pair there is a corresponding measured coding rate. All possible tuples (Q, M, R) define a 3D surface where the pairs (Q, M) are the independent variables while R is the dependent one as depicted in Figure 8. As explained above, to initialise the parameters associated with this 3D surface, few pairs (Q,M) can be considered and the model fitted accordingly as illustrated in Figure 8.

In another example the method described in this invention can be used to reduce the possibilities which need to be evaluated when spatial or temporal dependencies are considered. Figure 9 shows an example on how the pixels associated with the block compressed with quantisation step Qi can propagate in subsequent frames when inter coding is used. The figure depicts a forward propagation of pixels value but given bilinear prediction operated by modern video codecs, also areas in previous frame can be affected by the decisions taken on the block being compressed. As stated in the summary of the invention, the additional bits spent on this block can be compensated for the less bits spent on subsequent frames. Therefore in order to decide the best quantisation step to be used for this coding block, a rate - video quality relationship can be used where the video quality is the one associated with the pixels in the block and the pixels in subsequent frames which are predicted from.

Claims

1. A method of video encoding comprising the steps of: applying a selected one of a set of predictors to provide prediction values; generating residual values by subtracting prediction values from the pixel values; optionally applying a frequency transformation to the residual or pixel values to form transform coefficient values; and applying a quantisation operation to the pixel, residual or transform coefficient values, the quantisation operation comprising division by a selected quantisation step and reduction to integer values; where, in a single pre-encoding pass: at least some blocks of a video region are encoded with either the quantisation step or the predictor varying between the blocks; a coding rate measured for each block to provide a set of data points; and the set of data points is used to derive a parametrised function which is used to select the predictor and/or the quantisation step.

2. The method of claim 1 wherein said video region comprises an image; a content-based segmentation of an image, a compression performance-based segmentation of an image; or a motion linked group of regions from respective temporally spaced regions.

3. The method of claim 1 or claim 2 wherein the set of predictors includes at least one a spatial predictor using pixels located in the spatial neighbourhood and/or at least one temporal predictor using pixels belonging to frames in the temporal neighbourhood.

4. The method of any one of claim 1 to claim 3, wherein the number of data points is substantially less than the number of blocks, preferably by at least a factor of 10.

5. A method of video encoding comprising the steps of: applying a quantisation operation to the pixel, residual or transform coefficient values, the quantisation operation comprising division by a selected quantisation step Qs and reduction to integer values; where, in a single pre-encoding pass, at least some blocks of a video region are encoded with the quantisation step varying between the blocks; a coding rate R measured for each block to provide a set of data points Ri Qi; the set of data points is used to derive a parametrised function; and the function is used to select the quantisation step Qs given a selected rate Rs.

6. The method of claim 5, wherein the number of data points is substantially less than the number of blocks, preferably by at least a factor of 10.

7. The method of claim 5, wherein the number of data points is substantially less than the number of possible values of the quantisation step.

8. The method of any one of claim 5 to claim 7, wherein the parametrised function takes the form: R = f(Q) and is polynomial in Q.

9. The method of claim 8, wherein the parametrised function takes the form:

10. The method of any one of claim 5 to claim 9, wherein said video region comprises an image; a content-based segmentation of an image, a compression performance-based segmentation of an image; or a motion linked group of regions from respective temporally spaced regions.

11. A method of video encoding comprising the steps of: applying a selected one of a set of predictors to provide prediction values; generating residual values by subtracting prediction values from the pixel values; optionally applying a frequency transformation to the residual or pixel values to form transform coefficient values; and applying a quantisation operation to the pixel, residual or transform coefficient values, the quantisation operation comprising division by a selected quantisation step and reduction to integer values; where, in a single pre-encoding pass: at least some blocks of a video region are encoded with the predictor varying between the blocks; a coding rate R and a distortion D are measured for each block to provide a set of data points Rj Dj; and the set of data points is used to derive a parametrised function which is used to select the predictor and/or the quantisation step.

12. The method of claim 11, wherein the set of predictors includes at least one a spatial predictor using pixels located in the spatial neighbourhood and/or at least one temporal predictor using pixels belonging to frames in the temporal neighbourhood.

13. The method of claim 11 or claim 12, wherein the number of data points is substantially less than the number of blocks, preferably by at least a factor of 10.

14. The method of any one of claim 11 to claim 13, wherein the parametrised function takes the form: R = f(1/D) and is preferably logarithmic in 1/D.

15. The method of claim 14, wherein the parametrised function takes the form: R = k*ln(v*v * 1/D) where v denotes the variance of the original video region.

16. The method of any one of claim 11 to claim 15, wherein said video region comprises an image; a content-based segmentation of an image; a compression performance-based segmentation of an image; or a motion linked group of regions from respective temporally spaced regions.

17. A video encoder configured to operate in accordance with any one of the preceding claims.

18. A computer program product containing instructions for programmable apparatus to implement a method in accordance with any one of claims 1 to 17.