WO2009130540A1

WO2009130540A1 - Method for high definition video encoding/decoding suitable for real-time video streaming

Info

Publication number: WO2009130540A1
Application number: PCT/IB2008/051565
Authority: WO
Inventors: Victor Stepanov
Original assignee: MAXTU SA
Current assignee: MAXTU SA
Priority date: 2008-04-23
Filing date: 2008-04-23
Publication date: 2009-10-29
Anticipated expiration: 2010-10-23

Abstract

A method is described for the encoding and decoding of digital video files in order to facilitate said files storage or transmission in bandwidth-sensitive applications. The method is based on the detection of regions of contrast within the video data and involves the generation of two compressed data sets, which when recombined yield a high quality representation of the original data. One of the compressed data sets is produced from the original data which has been reduced in resolution, while the remaining compressed data set is produced from a set of data representing areas of contrast in the original data.

Description

METHOD FOR HIGH DEFINITION VIDEO ENCODING/DECODING SUITABLE FOR REAL-TIME VIDEO STREAMING

INTRODUCTION

The present invention relates to the domain of digital audio/video transmission or storage and in particular to the compression and decompression of an audio/video signal in order to minimise the amount of data to be transmitted or stored while allowing for the accurate reconstruction of the audio/video information contained within the transmitted or stored signal.

BACKGROUND OF THE INVENTION

The processing of digital video files has long been a subject of interest to those wishing to transmit, manipulate or otherwise process such files. For example, just one second of raw footage from a camcorder can represent up between 32MB (PAL) and 155MB (HDTV) of digital video data. Clearly then, the amount of data required to represent a film lasting just several minutes easily reaches several gigabytes. The communication systems which are generally in use today, and most storage media, do not have the bandwidth required to deal with such constraints posed by today's high definition digital video requirements. For this reason digital video data is generally compressed, thus allowing for easier transmission or more efficient storage of same. Video compression is achieved by exploiting redundancies in the original video data. The redundancies which are sought can be of two different forms: spatial and temporal redundancy or psycho-visual redundancy.

Spatial and temporal redundancy results from the fact that pixel values in an image are not independent: they are co-related with their neighbours - both in the spatial domain, i.e. from pixel to pixel within the same frame, and in the temporal domain, i.e. for the same equivalent pixel from one frame to the next. This means that to some extent, the value of a pixel may be predictable given the values of its neighbouring pixels. The analysis and subsequent treatment of video signals in this way can result in a type of encoding known as variable length encoding. This occurs when the quantization of the signal is such that regularly occurring events are mapped to short codes while more rare events are mapped to longer codes. From the point of view of psycho-visual redundancy, video compression techniques involve the analysis of an input video stream with a view to discarding information which is deemed indiscernible to the human brain. This usually results in some form of filtering whereby small details, which are considered indiscernible to the human eye, are discarded or whereby the available colour palette is reduced where small changes in tone are not easily perceived.

There are four generally accepted methods for compression: discrete cosine transform (DCT), vector quantization (VQ), fractal compression (FC) and discrete wavelet transform (DWT). DCT and DWT encoding both involve the mathematical transformation of images into their frequency components. In DWT this process is performed on the entire image, resulting in a hierarchical representation of the image with each layer in the hierarchy representing a particular frequency band. In DCT on the other hand, the frequency analysis is carried out on samples of the image taken at regular intervals. Following the frequency analysis, the components which do not affect the image as perceived by the human eye are simply discarded. The well- known standards such as JPEG, MPEG, (from the International Organization for Standardization and the International Electrotechnical Commission's Joint Photographic Experts Group and the Moving Pictures Experts Group, respectively), H.261 and H.263 (both from United Nations International Telecommunications Union) are based on this type of discreet cosine transform compression.

Vector Quantization techniques look at an array of data rather than individual values thus forming a general view of the subject. Redundant data is then rejected while at the same time retaining the desired object or the data stream's original intent. Fractal compression is a form of vector quantization. Compression is achieved by locating repeating sections of an image then using a fractal algorithm to generate those sections form the original.

The aim of the present invention is to overcome currently perceived limitations in transmission bandwidth as well as in the speed of current encoding algorithms by optimising a technique for encoding and decoding a video signal in order to allow for the delivery of high definition digital video of a very high quality and at a rate suitable for real-time streaming applications. SUMMARY OF THE INVENTION

The present invention relates to a method for encoding a digital data video file comprising a plurality of frames, said method yielding a plurality of compressed data files and comprising the following steps: - downsampling a first frame to obtain a first downsized frame having a lower resolution than the first frame;

- compressing the first downsized frame with a first lossy compression scheme to obtain a first compressed frame;

- uncompressing the first compressed frame to obtain a second downsized frame, - upconverting the second downsized frame to obtain a second frame of the same size as the first frame;

- generating a smoothed frame from the second frame by dividing said second frame into groups of a predefined number of pixels, inspecting the values of said pixels, calculating an average per group of said pixel values and replacing the pixel values in each group by the average per group thus calculated;

- generating an outline frame by comparing the smoothed frame with the first frame and recording differences which are greater than a predetermined threshold;

- generating a second compressed frame from the outline frame using a second compression scheme;

- outputting the first compressed frame and the second compressed frame.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will best be understood by reference to the following detailed description when read in conjunction with the accompanying drawings, wherein:

- FIG.1 shows a block diagram representing the encoding procedure used in one embodiment of the present invention;

- FIG.2 shows a block diagram of the decoding procedure used in one embodiment of the present invention; - FIG.3 shows a test frame and an example of what is described in the present invention as a contour frame or outline frame extracted from said test frame;

- FIG.4 shows a chart which summarises the comparison of colour intensities (in this case grayscale) in corresponding pixels between two frames;

- FIG.5 shows an alternative representation of the comparison of colour intensities in corresponding pixels between two frames;

- FIG.6 illustrates an image processing technique which may be used in am embodiment the present invention as part of the compression process;

- FIG.7 shows a block diagram of the encoding and decoding processes in an alternative embodiment of the current invention;

- FIG.8 shows a block diagram of the encoding and decoding processes in a further embodiment of the current invention.

DETAILED DESCRIPTION

In order to achieve the desired goals of speed of encoding, speed of transmission and quality of recovered image, the present invention makes use of a technique referred to as contour compression, which involves the accentuation of contrast regions in an image. This technique allows for the extraction of the most important details of an image, while neglecting the parts of the image which do not contribute additional information affecting the quality of the image with respect to the requirements of the human eye in assessing said image.

The encoding procedure is carried out on a frame-by-frame basis on the input video file, which is generally in some uncompressed format such as RAW or BMP. See FIG.1. The encoding technique involves the generation of a first compressed frame (CF1 ) and a second compressed frame (CF2), both of which are either stored or transmitted to a receiver (RX) for subsequent decoding.

First the resolution of the input video file (HDVID) is reduced. This reduction in resolution is done on a frame by frame basis wherein each original video frame or first frame is downsampled (DN) to give a downsized frame. By way of example, a picture whose x and y dimensions are reduced by a factor of two is referred to either as a 50% downsize or a 4x downsize, since there are four times less pixels in the resulting picture. The method used to achieve a 50% downsize, for example, could be simply the suppression of every second pixel in the horizontal direction and in the vertical direction. An alternative method could involve the grouping of pixels into groups of n pixels, calculating an average of the pixel intensities in each group and then replacing the n pixels by n/4 pixels, whose intensity is equal to the calculated average. The purpose of this downsizing is two-fold: to reduce the amount of data to be transmitted and to accentuate the regions of contrast in the picture during the generation of a so-called outline frame or contour frame. The downsized frame is compressed (CMP1 ) using a first compression scheme, which may be a standard scheme such as JPEG, MPEG, H.261 or H.263 or any other such lossy compression scheme or any lossless compression scheme, thus giving a first compressed frame (CF1 ), which is either stored or transmitted to the receiver (RX) for subsequent decoding.

As well as being stored or transmitted to the receiver (RX), the second part of the encoding procedure requires that the first compressed frame (CF1 ) be decompressed (DECMP1 ) using the same scheme that was used in the compression. The resulting decompressed frame, i.e. the second downsized frame, is then resized or upconverted (UP) by the same factor that was previously used in the downsizing process. This upconvert step produces the second frame. Upconverting can be achieved by interpolation or resampling. Any of the standard techniques can be used, such as nearest-neighbour or piecewise-constant interpolation or any of linear, bilinear, bicubic, polynomial, spline or fractal interpolation.

A process of smoothing or averaging (AVE) is carried out to smooth out boundaries between blocks present in the second frame. The averaging is done over 3x3 pixels or 5x5 pixels in order to overlap block boundaries present in the upsized version, since said block boundaries, which are created as a result of the interpolation procedure, normally occur at even numbers of pixels. By choosing to do the averaging over groups of pixels comprising nxn pixels where n is an odd number, the smoothing of the abrupt changes at block boundaries, which are typically present in images which have been resized, is therefore guaranteed.

The smoothed frame resulting from the smoothing process is compared (DIFF) to the first frame. The result of the comparison is an outline or contour frame (CONT). The outline frame is then compressed using a second compression scheme (CMP2), which may or may not be different from the first compression scheme, to give a second compressed frame (CF2). The second compressed frame is either stored or transmitted to the receiver to be used with the first compressed frame in the decoding procedure.

The object of the comparison process mentioned above is to produce a frame (CONT) containing outlines indicating areas of high contrast. FIG.3 illustrates a sample frame (ORIG) and one possible outline frame or contour frame (OUTLINE) associated with the sample frame. By way of example, consider a digitized image comprising an array of pixels. Each pixel has a colour attribute associated with it, said colour attribute being defined by one or a set of digital values. To simplify the example we limit the range of available colours to be those of the grayscale, in which case each pixel's "colour" or tone can therefore be described by a single digital value.

In case of colour images, the pixel attributes comprise, for each colour contributing to the overall tone of the pixel, a digital value representing the intensity of that colour. In the comparison process, the value of each pixel's tone in the original frame (first frame) is compared to the value of the corresponding pixel's tone in the smoothed frame. If the difference is less than a given threshold value, then a default value retained as the result of the comparison. In practice the default value is usually 0. Conversely, if the difference is greater than the given threshold, then the difference value is retained as the result of the comparison. In another embodiment of the present invention, instead of doing the comparison on a pixel-to-pixel basis, the comparison could be made between blocks of pixels.

The encoding technique used in the present invention ensures that the largest difference between the two frames being compared will occur at regions where the contrast is the highest i.e. at regions of abrupt changes in tone. This is due to the effects resulting from the resizing, compression, decompression and smoothing processes. Therefore the resulting outline frame will only contain information relative to areas of the original frame which have high contrast or significant changes in tone. The level of sensitivity to tone changes is selected by choosing an appropriate threshold: the higher the threshold, the less detail remains in the outline frame; the lower the threshold, the more detail appears in the outline frame. FIG.4 illustrates all the possible differences between two frames whose grayscale tone is encoded onto eight bits. A diagonal line from the bottom left to the top right of the chart would represent a threshold of zero and indicates the zone where both frames are exactly equal. As the line is broadened, so the threshold is increased to encompass ever- increasing differences between the two frames. FIG.5 shows another representation of the threshold concept. If the threshold were c, then the range of differences subtended by the area c would all be ignored. Areas 1 a/1 b, 2a/2b, 3a/3b and 4a/4b represent different compression ratios. For example, if the chosen threshold covers areas 2, 3, 4 and c, then only areas of very high contrast would be stored in the outline frame, whereas if the chosen threshold covers areas c and 4, then areas of lesser contrast would also be stored in the outline frame.

The threshold can be dynamically modified from frame to frame. Pre-processing of the image is carried out in order to determine the distribution of contrast present in the original image. In this way a tradeoff can be reached on a frame to frame basis whereby the threshold is chosen such that only the necessary minimum amount data is kept. Another type of pre-processing can also be carried out which takes the rate of change of the position of objects from one frame to the next into consideration. In this manner the threshold is modified depending on the speed of a moving object in the video. In the case of a portion of video with fast-moving objects, then the threshold could be set high since the contrast from frame to frame at the regions of interest will be large. Conversely, for video with slow-moving subjects, the threshold could be lowered. This dynamic modification of threshold allows for the optimization of the efficiency of the encoding.

The resulting outline frame from this procedure allows for good dynamic image definition and represents a high compression rate due to the presence of many zeros in the resulting file. The above discussion covers frame-by-frame compression. For more efficient compression, inter-frame compression techniques are used. Inter-frame compression can be readily realized based on the pre-processing of a selection of stored outline frames. In such a scheme, every fourth frame for example would be saved, or every fifth or sixth etc, and the outline frame extracted. The saved outline frames are used as a basis for calculating the missing frames.

The compression processes mentioned above can be of type JPEG, MPEG, H.261 or H.263 or any other such standard compression scheme or some proprietary scheme. The present invention also makes use of another scheme, which will be described using the following example. For simplicity, the example applies to a frame comprising a grayscale image, but it can be extrapolated to apply to full colour images. Again, consider an image comprising an array of pixels. Each pixel has a tone attribute associated with it, said tone attribute being defined by a digital value of n-bits. See FIG.6. A copy (A1 ) of an original image (O) is made, wherein only the most significant bit of each pixel is retained while all remaining n-1 bits are set to zero. The image A1 then has a maximum of only two tones. The regions where changes in tone occur are detected and stored in an outline frame (C1 ). C1 is compressed using a standard lossless compression scheme (e.g. Huffman coding) and stored. The resulting compression ratio is very high due to the presence of many consecutive zeros. A further copy (A2) of the original image is made wherein the two most significant bits are retained while all remaining bits are set to zero. A2 therefore has a maximum of four tones. An outline frame (C2) is generated. An exclusive-OR between C1 and C2 is done, thus keeping only the data which is different between the two frames. The resulting key frame (K2) is compressed (e.g. Huffman) and stored. The process continues, using three most significant bits to generate an outline frame C3, which is XOR-ed with C2 to give key frame K3, which is compressed (e.g. Huffman) and stored and so on until all n bits have been taken into account. The final compression file to be transmitted comprises all of the compressed key frames, Kn, and the first compressed outline frame C1. This effectively gives a lossless compression. In practice it has been observed that for the sake of more efficient compression, a reasonable quality of image can be achieved when the final compression file comprises a subset of those mentioned above. For example, the final compression file can contain just C1 , K2 and K4 for a good quality image to be recovered.

A practical method for achieving the result described above consists in treating the original image bit by bit, rather than pixel by pixel. Key files for each bit of an n-bit pixel can easily be extracted by copying each bit of the first n-bit pixel in the original frame to n corresponding key files, K1 to Kn, then inspecting each bit of the next pixel in the original frame and appending a 0 or a 1 in the bit's corresponding key file depending on the following condition: append a 0 whenever the next bit is the same as the previous bit; append a 1 whenever the next bit is different from the previous bit. The n key files are then compressed using a standard lossless compression scheme such as Huffman coding, thus creating a set of very compact compressed files due to the presence of many repeated zeros in each of the key files. By transmitting or storing all of the n compressed (e.g. Huffman) key files, lossless compression can be realized. For image compression, lossy compression is usually sufficient, wherein only a subset of the totality of the compressed (e.g. Huffman) key files is used. After decompressing (e.g. Huffman) the compressed key files, rebuilding the image is easily realized by generating a first pixel using the first bit of each of the key files (LSB from the first key file, MSB from the nth key file), thereby defining the colour or tone of the first pixel. The rest of the pixels in the frame are generated by analyzing each of the key files. For example, the MSB bit of the next pixel is determined by inspecting the next value in the Kn key file in order to determine whether the MSB of the next pixel should have the same value as the MSB in the preceding pixel or the opposite value. Similarly the K1 key file is used to determine the LSB values of the consecutive pixels and so on.

FIG.2 illustrates the decoding procedure using the first and second compressed frames (CF1 , CF2) which were generated during the encoding procedure. The first compressed frame (CF1 ) is decompressed (DECMP1 ) using the same compression scheme as was used to create said frame and then resized or upconverted (UP) by the same factor that was used in the downsizing step of the encoding procedure, resulting in a resized frame. The resized frame is averaged (AVE), thus smoothing out any blocks which may be present. As in the averaging process carried out during the encoding procedure, the averaging is done over 3x3 pixels or 5x5 pixels. The second compressed frame (CF2) is also decompressed to give a second outline frame (CONT). The second outline frame and the averaged frame are combined to give a decompressed final frame (FIN) representative of the original frame.

In another embodiment of the current design the outline frame in the encoding part of the process is generated by comparing the downsized frame with the second downsized frame as shown in FIG.7. One advantage of this embodiment is that there are fewer steps in the encoding, resulting in a faster encoding process. A further advantage is that the second compressed frame is smaller, resulting in the transmission or storage of a smaller file. In this embodiment, in order to be able to combine files of the same resolution on the decoder side, it is necessary to upsize the decompressed second compressed frame before combining the resulting second outline frame with the smoothed frame.

In yet another embodiment of the current invention the outline frame in the encoding part of the process is generated by comparing the first frame and the second frame. See FIG.8. This allows for a slightly shorter encoding process and it does not necessitate any upsizing of the decompressed second compressed frame on the decoding side.

In both of the above embodiments, in order to obtain satisfactory results, it is necessary to sharpen the resulting final frame by adding contrast to said frame.

Claims

1. A method for encoding digital video data describing pixel values for at least one colour, organized in a plurality of frames, said method yielding a plurality of compressed data files and comprising the following steps:

- downsampling a first frame to obtain a first downsized frame having a lower resolution than the first frame;

- uncompressing the first compressed frame to obtain a second downsized frame,

- upconverting the second downsized frame to obtain a second frame of the same size as the first frame;

- generating a smoothed frame from the second frame by dividing said second frame into groups of a predefined number of pixels and replacing the pixel values in each group by the average pixel value per group,

- generating an outline frame by comparing each pixel value within the first frame with each corresponding pixel value within the smoothed frame, recording a default value wherever the difference between said pixel values is less than a predetermined threshold and recording the difference value wherever the difference between said pixel values is greater than the predetermined threshold;

- outputting the first compressed frame and the second compressed frame.

2. The method of claim 1 wherein the downsampling of a frame is achieved by dividing the totality of the pixels in said frame into smaller groups of pixels and extracting one pixel per group to constitute the downsampled frame.

3. The method of claim 1 wherein the downsampling of a frame is achieved by dividing the totality of the pixels in said frame into smaller groups of pixels, calculating the average pixel value for each group and replacing all of the pixels in said group by one pixel whose value is equal to said average pixel value.

4. The method of either of claims 1 through 3 wherein the upconverting is achieved by interpolation.

5. The method of either of claims 1 through 4 wherein the predetermined threshold is calculated as a function of the contents of a frame of the digital video data and is recalculated at least once throughout the duration of the video.

6. The method of either of claims 1 though 5 wherein the predefined number of pixels in the groups considered during the smoothing operation is defined such that is not equal to the downsampling factor.

7. The method of either of claims 1 through 6 wherein the first compression scheme is a lossless compression scheme.

8. The method of either of claims 1 through 7 wherein any of the first or the second compression schemes used comprise the following steps:

- designating one of the pixels in a frame to be a first pixel;

- organizing the totality of the pixels in said frame in a series such that all pixels are accounted for;

- creating n files, each on of which correspond to one bit position in a pixel;

- recording the value of each bit of the first pixel in its corresponding file;

- appending each file with a value of 0 or 1 depending on whether the value of the corresponding bit of the next pixel in the series is equal to the value of the corresponding bit of the previous pixel or the value of the corresponding bit of the next pixel in the series is not equal to the value of corresponding bit of the previous pixel;

- performing a compression on the n files according to a lossless compression scheme;

- outputting a plurality of said compressed files.

9. A method for decoding digital video data from a plurality of compressed data files, said method yielding video data comprising values describing pixel values organized in a plurality of frames, said method comprising the following steps:

- receiving a first compressed data file; - uncompressing the first compressed data file to give a downsized frame;

- upconverting the downsized frame to give an upsized frame;

- generating a smoothed frame from the upsized frame by dividing said upsized frame into blocks of a predefined number of pixels, inspecting the individual values of said pixels, calculating an average per block of said pixel values and replacing the pixel values in each block by the average per block thus calculated;

- receiving a second compressed data file;

- uncompressing the second compressed data file to give a second contour frame;

- combining the smoothed frame and the second contour frame to give the decoded digital video frame.

10. The method of claim 9 wherein the upconverting is achieved by interpolation.

11. The method of either of claims 9 or 10 wherein the predefined number of pixels in the groups considered during the smoothing operation is defined such that is not equal to the downsampling factor.