[go: up one dir, main page]

WO2019077197A1 - A method and an apparatus and a computer program product for video encoding and decoding - Google Patents

A method and an apparatus and a computer program product for video encoding and decoding Download PDF

Info

Publication number
WO2019077197A1
WO2019077197A1 PCT/FI2018/050724 FI2018050724W WO2019077197A1 WO 2019077197 A1 WO2019077197 A1 WO 2019077197A1 FI 2018050724 W FI2018050724 W FI 2018050724W WO 2019077197 A1 WO2019077197 A1 WO 2019077197A1
Authority
WO
WIPO (PCT)
Prior art keywords
motion vector
block
model
prediction
motion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/FI2018/050724
Other languages
French (fr)
Inventor
Alireza Aminlou
Ramin GHAZNAVI YOUVALARI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2019077197A1 publication Critical patent/WO2019077197A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • H04N19/137Motion inside a coding unit, e.g. average field, frame or block difference
    • H04N19/139Analysis of motion vectors, e.g. their magnitude, direction, variance or reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/167Position within a video image, e.g. region of interest [ROI]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • H04N19/517Processing of motion vectors by encoding
    • H04N19/52Processing of motion vectors by encoding by predictive encoding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/56Motion estimation with initialisation of the vector search, e.g. estimating a good candidate to initiate a search
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/30Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using hierarchical techniques, e.g. scalability

Definitions

  • the present solution generally relates to video encoding and decoding. Background This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that
  • a video coding system may comprise an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form.
  • the encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example, to enable the storage/transmission of the video information at a lower bitrate than otherwise might be needed.
  • a method for video coding comprising obtaining motion information comprising at least one motion vector and a location of at least one neighboring block of a video frame; determining parameters of a model using the obtained motion information; and determining a predicted motion vector using the model and a location of a current block.
  • an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following obtain motion information comprising a motion vector and locations of neighboring blocks of a video frame; determine parameters of a model using the obtained motion information; and determine a predicted motion vector using the model and a location of a current block.
  • a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to obtain motion information comprising a motion vector and locations of neighboring blocks of a video frame; determine parameters of a model using the obtained motion information; and determine a predicted motion vector using the model and a location of a current block.
  • the model is a function that relates the motion vector of a block to the location of that block in the frame with a weight and an offset matrix.
  • the parameters comprise coefficients of the weight and the offset matrix of said model.
  • the predicted motion vector is added to a candidate list. According to an embodiment, the predicted motion vector is used as an initial search point for motion estimation.
  • the candidate list is an advanced motion vector predictor list.
  • the candidate list is a merge list.
  • the neighboring blocks are on a same layer than the current block.
  • the at least some of the neighboring blocks are on a different layer(s) or view(s)than the current block.
  • the method is executed at an encoder and/or in a decoder.
  • Fig. 1 shows an encoder according to an embodiment
  • Fig. 2 shows a decoder according to an embodiment
  • Fig. 3a,3b show examples of motion vector candidate positions
  • Fig. 4 is a flowchart illustrating a method according to an embodiment
  • Fig. 5 shows an example of a block to be encoded/decoded
  • Fig. 6 shows an example of locally adaptive motion vector prediction for multilayer (scalable) prediction
  • Fig. 7 shows an example of locally adaptive motion vector prediction for multiview prediction
  • Fig. 8 shows an apparatus according to an embodiment in a simplified block chart
  • Fig. 9 shows a layout of an apparatus according to an embodiment.
  • hybrid video codecs including H.264/AVC and HEVC, encode video information in two phases.
  • predictive coding is applied for example as so-called sample prediction and/or so-called syntax prediction.
  • syntax prediction is coding the error between the predicted block of pixels or samples and the original block of pixels or samples.
  • pixel or sample values in a certain picture area or "block" are predicted. These pixel or sample values can be predicted, for example, using one or more of the following ways:
  • Motion compensation mechanisms (which may also be referred to as temporal prediction or motion compensated temporal prediction or motion-compensated prediction or MCP), which involve finding and indicating an area in one of the previously encoded video frames that corresponds closely to the block being coded;
  • syntax prediction relating to the first phase which may also be referred to as parameter prediction
  • syntax elements and/or syntax element values and/or variable derived from syntax elements are predicted from syntax elements (de)coded earlier and/or variables derived earlier.
  • motion vectors are coded e.g. for inter and/or interview prediction.
  • the block partitioning e.g. from CTU (Coding Tree Unit) to CUs (Coding Unit) and down to PUs (Prediction Unit), may be predicted.
  • CTU Coding Tree Unit
  • CUs Coding Unit
  • Prediction Unit Prediction Unit
  • the filtering parameters e.g. for sample adaptive offset may be predicted.
  • a video codec comprises an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form.
  • An image codec or a picture codec is similar to a video codec, but it encodes each input picture independently from other input pictures and decodes each coded picture independently from other coded pictures. It needs to be understood that whenever a video codec, video encoding or encoder, or video decoder or decoding is referred below, the text similarly applies to an image codec, image encoding or encoder, or image decoder or decoding, respectively.
  • Prediction approaches using image information from a previously coded image can also be called as inter prediction methods which may also be referred to as temporal prediction and motion compensation.
  • Prediction approaches using image information within the same image can also be called as intra prediction methods.
  • the second phase is one of coding the error between the predicted block of pixels or samples and the original block of pixels or samples.
  • This may be accomplished by transforming the difference in pixel or sample values using a specified transform.
  • This transform may be e.g. a Discrete Cosine Transform (DCT) or a variant thereof.
  • DCT Discrete Cosine Transform
  • the transformed difference may be quantized and entropy coded.
  • the encoder can control the balance between the accuracy of the pixel or sample representation (i.e. the visual quality of the picture) and the size of the resulting encoded video representation (i.e. the file size or transmission bit rate).
  • the decoder reconstructs the output video by applying a prediction mechanism similar to that used by the encoder in order to form a predicted representation of the pixel or sample blocks (using the motion or spatial information created by the encoder and included in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized prediction error signal in the spatial domain).
  • the decoder After applying pixel or sample prediction and error decoding processes the decoder combines the prediction and the prediction error signals (the pixel or sample values) to form the output video frame.
  • the decoder may also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing as a prediction reference for the forthcoming pictures in the video sequence.
  • Figure 1 illustrates an image to be encoded (l n ); a predicted representation of an image block (P' n ); a prediction error signal (D n ); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (l' n ); a final reconstructed image (R' n ); a transform (T) and inverse transform (T _1 ); a quantization (Q) and inverse quantization (Cr 1 ); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F).
  • An example of a decoding process is illustrated in Figure 2.
  • Figure 2 illustrates a predicted representation of an image block (P'n); a reconstructed prediction error signal (D' n ); a preliminary reconstructed image (l' n ); a final reconstructed image (R' n ); an inverse transform (T ⁇ 1 ); an inverse quantization (Cr 1 ); an entropy decoding (E ⁇ 1 ); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F).
  • a picture given as an input to an encoder may also referred to as a source picture, and a picture decoded by a decoded may be referred to as a decoded picture.
  • the source and decoded pictures are each comprised of one or more sample arrays, such as one of the following sets of sample arrays:
  • RGB Green, Blue and Red
  • Term pixel may refer to the set of spatially collocating samples of the sample arrays of the color components. Sometimes, depending on the context, term pixel may refer to a sample of one sample array only.
  • a picture may either be a frame or a field, while in some coding systems a picture may be constrained to be a frame.
  • a frame comprises a matrix of luma samples and possibly the corresponding chroma samples.
  • a field is a set of alternate sample rows of a frame and may be used as encoder input, when the source signal is interlaced.
  • motion information is indicated by motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder), and the prediction sources block in one of the previously coded or decoded images (or picture).
  • H.264/AVC and HEVC divide a picture into a mesh of rectangles, for each of which a similar block in one of the reference pictures is indicated for inter prediction.
  • the location of the prediction block is coded as a motion vector that indicates the position of the prediction block relative to the block being coded.
  • H.264/AVC and HEVC include a concept of picture order count (POC).
  • a value of POC is derived for each picture and is non-decreasing with increasing picture position in output order. POC therefore indicates the output order or pictures.
  • POC may be used in the decoding process, for example for implicit scaling of motion vectors in the temporal direct mode, for implicitly derived weights in weighted prediction, and for reference picture list initialization. Furthernnore, POC may be used in the verification of output order conformance.
  • Inter prediction process may use one or more of the following factors.
  • motion vectors may be of quarter-pixel accuracy, and sample values in fractional-pixel positions may be obtained using a finite impulse response (FIR) filter.
  • FIR finite impulse response
  • the accuracy of motion vector, motion vector prediction and motion vector difference may be different for each block, and they may vary for different blocks.
  • H.264/AVC and HEVC allow selection of the size and shape of the block for which a motion vector is applied for motion-compensated prediction in the encoder, and indicating the selected size and shape in the bitstream so that decoders can reproduce the motion-compensated prediction done in the encoder.
  • reference pictures for inter prediction.
  • the sources of inter prediction are previously decoded pictures.
  • Many coding standards including H.264/AVC and HEVC, enable storage of multiple reference pictures for inter prediction and selection of the used reference picture on a block basis. For example, reference pictures may be selected on macroblock or macroblock partition basis in
  • H.264/AVC and on PU or CU basis in HEVC.
  • Many coding standards such as H.264/AVC and HEVC, include syntax structures in the bitstream that enable decoders to create one or more reference picture lists.
  • a reference picture index to a reference picture list may be used to indicate which one of the multiple reference pictures is used for inter prediction for a particular block.
  • a reference picture index may be coded by an encoder into the bitstream in some inter coding modes or it may be derived (by an encoder and a decoder) for example using neighboring blocks in some other inter coding modes.
  • motion vectors may be coded differentially with respect to a block- specific predicted motion vector.
  • the predicted motion vectors are created in a predefined way, for example by calculating the media of the encoded or decoded motion vectors of the adjacent blocks.
  • Another way to create motion vectors prediction is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures, and co-located block in the other layers or views, and signaling the chosen candidate as the motion vector predictors.
  • the motion vector candidate positions (MVP) are as shown by means of an example in Figures 3a-b.
  • Figures 3a-b black dots indicate sample positions directly adjacent to a block X defining positions of possible MVPs.
  • Figure 3a illustrates spatial MVPs positions
  • Figure 3b illustrates temporal MVPs (TMVP) positions, where Y is a collocated block of X in a reference picture. Positions CO and C1 are candidates for the TMVP.
  • the reference index values can be predicted or obtained from previously coded/decoded blocks and pictures. The reference index may be predicted e.g. from adjacent blocks and/or co-located blocks in temporal reference picture. Differential coding of motion vectors may be disabled across slice or tile boundaries.
  • all the motion field information which includes motion vector(s) and corresponding reference picture index(es) for each available reference picture list, is predicted and used without any modification/correction.
  • this is performed by generating a merge list which includes some candidates from the motion information of the neighboring blocks. Then the index to the proper candidate is indicated in the bitstream.
  • Each candidate in the merge list may include the prediction type (uni- prediction or bi-prediction), reference indices and motion vectors.
  • H.264/AVC and HEVC enable the use of a single prediction block in P slices (herein referred to as uni- predictive slices) or a linear combination of two motion-compensated prediction blocks for bi-predictive slices, which are also referred to as B slices.
  • Overlapped block motion Compensation (OBMC) method may use prediction from more reference frames.
  • Individual blocks in B slices may be bi-predicted, uni- predicted, or intra-predicted, and individual blocks in P slices may be uni- predicted or intra-predicted.
  • the reference pictures for a bi-predictive picture may not be limited to be the subsequent picture and the previous picture in output order, but rather any reference pictures may be used.
  • reference picture list 0 In many coding standards, such as H.264/AVC and HEVC, one reference picture list, referred to as reference picture list 0, is constructed for P slices, and two reference picture lists, list 0 and list 1 , are constructed for B slices.
  • B slices when prediction in forward direction may refer to prediction from a reference picture in reference picture list 0, and prediction in backward direction may refer to prediction from a reference picture in reference picture list 1 , even though the reference pictures for prediction may have any decoding or output order relation to each other or to the current picture.
  • H.264/AVC allows weighted prediction for both P and B slices.
  • the weights are proportional to weighted prediction for both P and B slices.
  • the weights are proportional to picture order counts (POC), while in explicit weighted prediction, prediction weights are explicitly indicated.
  • the motion vectors may be coded using the spatial/temporal candidates. However, this is not found efficient in the some types of video content because of the following:
  • the video may include object/scene deformations, due to image format, and different sampling in various regions when representing 360-degree videos on 2D representation (e.g. equirectangular, cubemap projection).
  • the video content may have zooming or rotation, caused by an object or a camera.
  • the video content may be deformed due to the characteristics of the capturing device (e.g. fisheye lenses).
  • the characteristics of the capturing device e.g. fisheye lenses.
  • magnitude and direction of motion vectors change gradually by the location of the block.
  • intra prediction may be chosen by the encoder instead of inter prediction, which results higher bitrate compared to inter prediction.
  • the problem has been treated by using motion vectors from neighboring blocks for predicting the motion vector of the current block using linear function.
  • Parameters of linear function are fixed, and are calculated using a training process on several training test sequences.
  • such solution considers only the motion vector values for the modeling process with some fixed parameters and hence is not able to achieve a proper model for the motion.
  • the problem is solved by modelling the motion vector locally based on the location (e.g. a center point or top-left corner) of the blocks in regions of a video frame.
  • the parameters of the model are calculated based on the motion vectors and locations of at least one neighboring block comprising a set of motion vectors. Then a predicted motion vector is calculated using the model and the location of the current block.
  • This predicted motion vector is added to a candidate list, which can be an advanced motion vector prediction list or merge list, as one of the candidates which can be selected by the encoder.
  • the process may be executed at an encoder side or a decoder side when generating the candidate list.
  • a method according to an embodiment is shown as a flowchart in Figure 4. The method comprises
  • the method illustrated in Figure 4 can be executed in an encoder or a decoder. When the method is executed at the encoder, the method further comprises adding the predicted motion vector to a candidate list 440.
  • the method of Figure 4 may be applied to a single layer coding, where all the neighboring blocks are on a same layer, or to a multilayer or multiview coding, where neighboring blocks are on the same layer/view, but may also be on other layers/views.
  • the model referred in the method of flowchart of Figure 4 may be defined as follows. It is assumed that the motion information of different blocks and/or regions of a video frame is modelled with a specific function. This function relates a motion vector (x and y components) of a block to its location (e.g. x and y coordinates of the center of the block). Different models can be used for this purpose. In one sense, the model can be a cross component, i.e. x and y component of motion vector are related to both x and y coordinates of block location. Alternatively, the model can be limited to each component, i.e.
  • x and y components of motion vectors are only related to x and y coordinates of the block location, respectively, or vice versa (i.e., x to y, and y to x).
  • the model can be linear, polynomial, or any other general function.
  • the model can be used to model the motion only in small region of the video, so linear model should efficiently work for this purpose.
  • MVx and MVy are x and y components of motion vector of a block
  • X and Y are the location (e.g. center coordinates) of the block
  • f(.) can be any function like sinusoidal, exponential or logarithmic functions
  • X0, Y0, MVOx, MVOy are fixed values that, for example, may be calculated based on the motion vectors and locations of neighboring blocks
  • a, b, c are parameters that are obtained based on training process using neighboring blocks.
  • the function relates the motion vector of a block to the location of that block in the frame with a weight and an offset matrix.
  • the parameters used in the model comprise coefficients of the weight and the offset matrix of the model. It is appreciated that the models presented below are examples, and any other model can be used instead.
  • Each model has several parameters that are calculated for each block using the motion information (i.e. a set of motion vectors from each block and location of each block) of at least one neighboring blocks.
  • a neighboring block can be of the following: a block on a top of the current block, a block on left of the current block, a block on top-left of the current block, a block on bottom-left of the current block, a block on top-right of the current block.
  • motion vectors and location (e.g. center point) of said at least one neighboring block are collected and a training process (e.g. linear regression, polynomial regression, logical regression, RANSAC (Random Sample Consensus) etc.) may be used to calculate the parameters.
  • a training process e.g. linear regression, polynomial regression, logical regression, RANSAC (Random Sample Consensus) etc.
  • Neighboring region relates to blocks locating at certain distance from the current block, wherein the distance may be defined by two or more blocks in a certain direction, e.g. up or left.
  • the size of the neighboring region may be selected based on the size of the current block. For example, the blocks in the top, left, top-left, bottom-left and top-right regions of the current block may be used to train the model.
  • the size of the neighboring block may be considered in training process. For example, the information of the larger blocks may have more influence on the model's parameters. In HEVC standard, for example, motion information may be stored in 4x4 block accuracy.
  • the motion vector of a block when it is larger than 4x4, may be considered several times in the training process.
  • the influence of the motion vector of each block in model extraction becomes proportional to the size of that block.
  • the parameters of the model are calculated based on the motion vectors and locations of at least one neighboring block comprising a set of motion vectors. The more there are neighboring blocks, the more accurate the predicted motion vector will be.
  • Each model that is used in the present solution has several parameters that should be calculated based on the neighboring block(s). If there are enough (i.e. more than number of model's parameters) motion vectors in the neighboring blocks, a fitting can be performed, e.g. by minimizing the mean square error. On the other hand, if there is exactly the same number of motion vectors in the neighboring blocks as the number of the parameters, "a system of equations" can be build, and the exact value of parameters can be calculated.
  • each motion vector has two components, including x and y directions, which is counted as two values for training. So, for example, if a neighboring block is coded in uni-prediction mode, it has one motion vector, hence two values for the training, and if a neighboring block is coded in bi-prediction mode, it has two motion vectors, hence four values for the training process. Other information of the neighboring block (e.g. if it is coded using adaptive MVP mode or not, affine parameters) may be used in the training process.
  • the calculated motion vector may be quantize to the precision of motion vector or motion vector difference. As another way, the predicted motion vector may be kept in higher precision.
  • some of the neighboring blocks may be related to different object than the object to which the current block belongs to. Therefore the blocks whose motion information are not in harmony with the current block or other neighboring blocks can be removed. This outlier removal can be done for example by running the training process twice. Instead, other simple methods can be used to eliminate outliers, for example by classifying the neighboring blocks into, for example, two classes, and for each one, to extract a separate model and motion vector predictor.
  • each P-frame may have several reference frames, so each block may be predicted from different reference frames. Because of the motion in temporal domain, a block may have different motion vectors for different reference frames. So this issue needs to be carefully considered in training phase, and in motion vector prediction.
  • AMVP list motion vector of a block is predicted for a given reference frame (with specific POC).
  • merge list motion vector may be calculated for one reference frame (like uni-prediction) or for two/more reference frames (e.g. bi-predicted or OBMC).
  • the neighboring blocks are predicted from different reference frame(s) than the current reference frame, their motion information should be either eliminated from the training process, or should be scaled according to the POC numbers.
  • priority may be given to one of the reference frames which is mostly used in the neighboring blocks.
  • the proposed process may be executed twice, one for each reference frame. And in each execution, the training process may be executed using the related neighboring motion information.
  • the motion vector may be calculated by scaling the motion vectors using POC values of the current frame and reference frames. This approach can reduce the computational complexity of the method.
  • neighboring motion vectors come from the same layer and/or view. It is, however, also possible, particularly in a case of multilayer coding that neighboring motion vectors may come from other layers and views as well. For example, motion information of the blocks in the right and bottom side of the current block which are not coded yet, or the blocks that are coded in Intra mode, may come from previous layer(s) (after proper scaling) or other view(s). In such a case, two candidates may be added to the candidate list. One of the candidates may be calculated based on the neighboring blocks which are predicted from the same reference frame, and the other one may be calculated from the neighboring blocks of the co-located block in the other layer(s) or view(s).
  • Figures 6 and 7 illustrate the motion vector information collecting in case of multilayer (scalable) and multiview prediction, respectively.
  • the additional motion vector information that are collected from other layer(s) or view(s) are not limited to the illustrated ones, but can include any candidates that are not available in the current frame.
  • the reference layer 610 could be in the same or different resolution as the current layer 600.
  • the quality can be different to the current layer 600 video.
  • two step scaling could be applied as following: scaling the motion vectors according the POC number and scaling the motion vectors according the resolution of the reference layer 610.
  • the method - when executed at an encoder - comprises adding the predicted motion information to a candidate list, e.g. to an advanced motion vector prediction list (AMVP) or to a merge list.
  • Adding the motion vector to the list and its location in the list can be controlled by, for example, picture/frame/slice level flag. This flag, for example, can be disabled for videos with less motion activity or based on the video type (e.g. non 360-degree videos).
  • the proposed motion vector candidate may be added to merge or AMVP list, at the beginning, middle or end of the list. It may be added to list by increasing the number of candidates, or may replace one of the candidates by fixing the length of the list. Based on some conditions (e.g. when neighboring blocks has very diverse motion vectors, or model parameters cannot be calculated), the proposed one may not be added to the list.
  • the solution presented in this description can be executed at an encoder or a decoder.
  • the proposed motion vector may be used as the initial point for motion search.
  • An apparatus comprises means for obtaining motion information comprising a motion vector and locations of neighboring blocks of a video frame; means for determining parameters of a model using the obtained motion information; and means for determining a predicted motion vector using the model and a location of a current block
  • the apparatus further comprises means for adding the predicted motion vector to a candidate list.
  • These means comprises at least one processor, and a memory including computer program code comprising one or more operational characteristics.
  • FIG. 8 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder.
  • Figure 9 shows a layout of an apparatus according to an embodiment.
  • the electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device.
  • the electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer.
  • the device may be also comprised as part of a head-mounted display device.
  • the apparatus 50 may comprise a housing 30 for incorporating and protecting the device.
  • the apparatus 50 may further comprise a display 32 in the form of a liquid crystal display.
  • the display may be any suitable display technology suitable to display an image 30 or video.
  • the apparatus 50 may further comprise a keypad 34.
  • any suitable data or user interface mechanism may be employed.
  • the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
  • the apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input.
  • the apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection.
  • the apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator).
  • the apparatus may further comprise a camera 42 capable of recording or capturing images and/or video.
  • the camera 42 is a multi-lens camera system having at least two camera sensors.
  • the camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing.
  • the apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
  • the apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices.
  • the apparatus may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection.
  • the apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50.
  • the apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry.
  • the controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller.
  • the apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
  • the apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network.
  • the 30 apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
  • the apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection.
  • wired interface may be configured to operate according to one or more digital display interface standards, such as for example High-Definition Multimedia Interface (HDMI), Mobile High-definition Link (MHL), or Digital Visual Interface (DVI).
  • HDMI High-Definition Multimedia Interface
  • MHL Mobile High-definition Link
  • DVI Digital Visual Interface
  • the various embodiments may provide advantages. For example, changes are local, with minimum changes in adding new candidate to the AMVP/Merge lists, and no change in the bitstream syntax.
  • the model is generic and locally adaptive. It can model different changes in motion or deformation in objects in different ways. For example, it can support zooming in and out, rotation, and object deformation for example in 360- degree video formats.
  • the various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention.
  • a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment.
  • a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

L'invention concerne un procédé de codage vidéo, comprenant l'obtention d'informations de mouvement comprenant au moins un vecteur de mouvement et un emplacement d'au moins un bloc voisin d'une trame vidéo (410), la détermination des paramètres d'un modèle à l'aide des informations obtenues de mouvement (420), la détermination d'un vecteur de mouvement prédit à l'aide du modèle et d'un emplacement d'un bloc actuel (430) et l'ajout du vecteur de mouvement prédit à une liste de candidats (440). La présente invention concerne également un appareil et un produit programme d'ordinateur.The invention relates to a video coding method, comprising obtaining motion information comprising at least one motion vector and a location of at least one neighboring block of a video frame (410), determining the parameters of a model using the motion information obtained (420), determining a motion vector predicted using the model and a location of a current block (430) and adding the vector predicted motion to a list of candidates (440). The present invention also relates to an apparatus and a computer program product.

Description

A METHOD AND AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR VIDEO ENCODING AND DECODING
Technical Field
The present solution generally relates to video encoding and decoding. Background This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that
could be pursued, but are not necessarily the ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims and is not admitted to be prior art by inclusion in this section.
A video coding system may comprise an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. The encoder may discard some information in the original video sequence in order to represent the video in a more compact form, for example, to enable the storage/transmission of the video information at a lower bitrate than otherwise might be needed. Summary
Now there has been invented an improved method and technical equipment implementing the method. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.
According to a first aspect, there is provided a method for video coding, comprising obtaining motion information comprising at least one motion vector and a location of at least one neighboring block of a video frame; determining parameters of a model using the obtained motion information; and determining a predicted motion vector using the model and a location of a current block. According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following obtain motion information comprising a motion vector and locations of neighboring blocks of a video frame; determine parameters of a model using the obtained motion information; and determine a predicted motion vector using the model and a location of a current block.
According to a third aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to obtain motion information comprising a motion vector and locations of neighboring blocks of a video frame; determine parameters of a model using the obtained motion information; and determine a predicted motion vector using the model and a location of a current block.
According to an embodiment, the model is a function that relates the motion vector of a block to the location of that block in the frame with a weight and an offset matrix. According to an embodiment, the parameters comprise coefficients of the weight and the offset matrix of said model.
According to an embodiment, the predicted motion vector is added to a candidate list. According to an embodiment, the predicted motion vector is used as an initial search point for motion estimation.
According to an embodiment, the candidate list is an advanced motion vector predictor list.
According to an embodiment, the candidate list is a merge list.
According to an embodiment, the neighboring blocks are on a same layer than the current block.
According to an embodiment, the at least some of the neighboring blocks are on a different layer(s) or view(s)than the current block. According to an embodiment, the method is executed at an encoder and/or in a decoder.
Description of the Drawings
In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which
Fig. 1 shows an encoder according to an embodiment;
Fig. 2 shows a decoder according to an embodiment;
Fig. 3a,3b show examples of motion vector candidate positions;
Fig. 4 is a flowchart illustrating a method according to an embodiment;
Fig. 5 shows an example of a block to be encoded/decoded;
Fig. 6 shows an example of locally adaptive motion vector prediction for multilayer (scalable) prediction;
Fig. 7 shows an example of locally adaptive motion vector prediction for multiview prediction;
Fig. 8 shows an apparatus according to an embodiment in a simplified block chart, and
Fig. 9 shows a layout of an apparatus according to an embodiment.
Description of Example Embodiments
In the following, several embodiments of the invention will be described in the context of video coding. In particular, the several embodiments enable determining locally adaptive motion vector predictor for video coding.
Many hybrid video codecs, including H.264/AVC and HEVC, encode video information in two phases. In the first phase, predictive coding is applied for example as so-called sample prediction and/or so-called syntax prediction. The second phase is coding the error between the predicted block of pixels or samples and the original block of pixels or samples.
In the sample prediction relating to the first phase, pixel or sample values in a certain picture area or "block" are predicted. These pixel or sample values can be predicted, for example, using one or more of the following ways:
- Motion compensation mechanisms (which may also be referred to as temporal prediction or motion compensated temporal prediction or motion-compensated prediction or MCP), which involve finding and indicating an area in one of the previously encoded video frames that corresponds closely to the block being coded;
- Intra prediction, where pixel or sample values can be predicted by spatial mechanisms which involve finding and indicating a spatial region relationship. In the syntax prediction relating to the first phase, which may also be referred to as parameter prediction, syntax elements and/or syntax element values and/or variable derived from syntax elements are predicted from syntax elements (de)coded earlier and/or variables derived earlier. Non-limiting examples of syntax prediction are provided below:
- In motion vector prediction, motion vectors are coded e.g. for inter and/or interview prediction.
- The block partitioning, e.g. from CTU (Coding Tree Unit) to CUs (Coding Unit) and down to PUs (Prediction Unit), may be predicted.
- In filter parameter prediction, the filtering parameters e.g. for sample adaptive offset may be predicted.
A video codec comprises an encoder that transforms an input video into a compressed representation suited for storage/transmission and a decoder that can uncompress the compressed video representation back into a viewable form. An image codec or a picture codec is similar to a video codec, but it encodes each input picture independently from other input pictures and decodes each coded picture independently from other coded pictures. It needs to be understood that whenever a video codec, video encoding or encoder, or video decoder or decoding is referred below, the text similarly applies to an image codec, image encoding or encoder, or image decoder or decoding, respectively.
Prediction approaches using image information from a previously coded image can also be called as inter prediction methods which may also be referred to as temporal prediction and motion compensation. Prediction approaches using image information within the same image can also be called as intra prediction methods.
As already mentioned, the second phase is one of coding the error between the predicted block of pixels or samples and the original block of pixels or samples. This may be accomplished by transforming the difference in pixel or sample values using a specified transform. This transform may be e.g. a Discrete Cosine Transform (DCT) or a variant thereof. After transforming the difference, the transformed difference may be quantized and entropy coded.
By varying the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel or sample representation (i.e. the visual quality of the picture) and the size of the resulting encoded video representation (i.e. the file size or transmission bit rate).
The decoder reconstructs the output video by applying a prediction mechanism similar to that used by the encoder in order to form a predicted representation of the pixel or sample blocks (using the motion or spatial information created by the encoder and included in the compressed representation of the image) and prediction error decoding (the inverse operation of the prediction error coding to recover the quantized prediction error signal in the spatial domain).
After applying pixel or sample prediction and error decoding processes the decoder combines the prediction and the prediction error signals (the pixel or sample values) to form the output video frame.
The decoder (and encoder) may also apply additional filtering processes in order to improve the quality of the output video before passing it for display and/or storing as a prediction reference for the forthcoming pictures in the video sequence.
An example of an encoding process is illustrated in Figure 1 . Figure 1 illustrates an image to be encoded (ln); a predicted representation of an image block (P'n); a prediction error signal (Dn); a reconstructed prediction error signal (D'n); a preliminary reconstructed image (l'n); a final reconstructed image (R'n); a transform (T) and inverse transform (T_1); a quantization (Q) and inverse quantization (Cr1); entropy encoding (E); a reference frame memory (RFM); inter prediction (Pinter); intra prediction (Pintra); mode selection (MS) and filtering (F). An example of a decoding process is illustrated in Figure 2. Figure 2 illustrates a predicted representation of an image block (P'n); a reconstructed prediction error signal (D'n); a preliminary reconstructed image (l'n); a final reconstructed image (R'n); an inverse transform (T~1); an inverse quantization (Cr 1); an entropy decoding (E~1); a reference frame memory (RFM); a prediction (either inter or intra) (P); and filtering (F). A picture given as an input to an encoder may also referred to as a source picture, and a picture decoded by a decoded may be referred to as a decoded picture. The source and decoded pictures are each comprised of one or more sample arrays, such as one of the following sets of sample arrays:
- Luma (Y) only (monochrome).
- Luma and two chroma (YCbCr or YCgCo).
- Green, Blue and Red (GBR, also known as RGB).
- Arrays representing other unspecified monochrome or tri-stimulus color samplings (for example, YZX, also known as XYZ). Term pixel may refer to the set of spatially collocating samples of the sample arrays of the color components. Sometimes, depending on the context, term pixel may refer to a sample of one sample array only.
In some coding systems, a picture may either be a frame or a field, while in some coding systems a picture may be constrained to be a frame. A frame comprises a matrix of luma samples and possibly the corresponding chroma samples. A field is a set of alternate sample rows of a frame and may be used as encoder input, when the source signal is interlaced. In many video codecs, including H.264/AVC and HEVC, motion information is indicated by motion vectors associated with each motion compensated image block. Each of these motion vectors represents the displacement of the image block in the picture to be coded (in the encoder) or decoded (at the decoder), and the prediction sources block in one of the previously coded or decoded images (or picture). H.264/AVC and HEVC, as many other video compression standards, divide a picture into a mesh of rectangles, for each of which a similar block in one of the reference pictures is indicated for inter prediction. The location of the prediction block is coded as a motion vector that indicates the position of the prediction block relative to the block being coded. H.264/AVC and HEVC include a concept of picture order count (POC). A value of POC is derived for each picture and is non-decreasing with increasing picture position in output order. POC therefore indicates the output order or pictures. POC may be used in the decoding process, for example for implicit scaling of motion vectors in the temporal direct mode, for implicitly derived weights in weighted prediction, and for reference picture list initialization. Furthernnore, POC may be used in the verification of output order conformance.
Inter prediction process may use one or more of the following factors.
- The accuracy of motion vector representation. For example, motion vectors may be of quarter-pixel accuracy, and sample values in fractional-pixel positions may be obtained using a finite impulse response (FIR) filter. The accuracy of motion vector, motion vector prediction and motion vector difference may be different for each block, and they may vary for different blocks.
- Block partitioning for inter prediction. Many coding standards, including
H.264/AVC and HEVC, allow selection of the size and shape of the block for which a motion vector is applied for motion-compensated prediction in the encoder, and indicating the selected size and shape in the bitstream so that decoders can reproduce the motion-compensated prediction done in the encoder.
- Number of reference pictures for inter prediction. The sources of inter prediction are previously decoded pictures. Many coding standards, including H.264/AVC and HEVC, enable storage of multiple reference pictures for inter prediction and selection of the used reference picture on a block basis. For example, reference pictures may be selected on macroblock or macroblock partition basis in
H.264/AVC and on PU or CU basis in HEVC. Many coding standards, such as H.264/AVC and HEVC, include syntax structures in the bitstream that enable decoders to create one or more reference picture lists. A reference picture index to a reference picture list may be used to indicate which one of the multiple reference pictures is used for inter prediction for a particular block. A reference picture index may be coded by an encoder into the bitstream in some inter coding modes or it may be derived (by an encoder and a decoder) for example using neighboring blocks in some other inter coding modes.
- Motion vector prediction. In order to represent motion vectors efficiently in bitstreams, motion vectors may be coded differentially with respect to a block- specific predicted motion vector. In many video codecs, the predicted motion vectors are created in a predefined way, for example by calculating the media of the encoded or decoded motion vectors of the adjacent blocks. Another way to create motion vectors prediction, sometimes referred to as advanced motion vector prediction (AMVP), is to generate a list of candidate predictions from adjacent blocks and/or co-located blocks in temporal reference pictures, and co-located block in the other layers or views, and signaling the chosen candidate as the motion vector predictors. The motion vector candidate positions (MVP) are as shown by means of an example in Figures 3a-b. In Figures 3a-b black dots indicate sample positions directly adjacent to a block X defining positions of possible MVPs. Figure 3a illustrates spatial MVPs positions, and Figure 3b illustrates temporal MVPs (TMVP) positions, where Y is a collocated block of X in a reference picture. Positions CO and C1 are candidates for the TMVP. In addition to predicting the motion vector values, the reference index values can be predicted or obtained from previously coded/decoded blocks and pictures. The reference index may be predicted e.g. from adjacent blocks and/or co-located blocks in temporal reference picture. Differential coding of motion vectors may be disabled across slice or tile boundaries. In the merge mode, all the motion field information, which includes motion vector(s) and corresponding reference picture index(es) for each available reference picture list, is predicted and used without any modification/correction. In HEVC, for example, this is performed by generating a merge list which includes some candidates from the motion information of the neighboring blocks. Then the index to the proper candidate is indicated in the bitstream. Each candidate in the merge list may include the prediction type (uni- prediction or bi-prediction), reference indices and motion vectors.
- Multi-hypothesis motion-compensated prediction. H.264/AVC and HEVC enable the use of a single prediction block in P slices (herein referred to as uni- predictive slices) or a linear combination of two motion-compensated prediction blocks for bi-predictive slices, which are also referred to as B slices. Overlapped block motion Compensation (OBMC) method may use prediction from more reference frames. Individual blocks in B slices may be bi-predicted, uni- predicted, or intra-predicted, and individual blocks in P slices may be uni- predicted or intra-predicted. The reference pictures for a bi-predictive picture may not be limited to be the subsequent picture and the previous picture in output order, but rather any reference pictures may be used. In many coding standards, such as H.264/AVC and HEVC, one reference picture list, referred to as reference picture list 0, is constructed for P slices, and two reference picture lists, list 0 and list 1 , are constructed for B slices. For B slices, when prediction in forward direction may refer to prediction from a reference picture in reference picture list 0, and prediction in backward direction may refer to prediction from a reference picture in reference picture list 1 , even though the reference pictures for prediction may have any decoding or output order relation to each other or to the current picture.
- Weighted prediction. Many coding standards use a prediction weight of 1 for prediction blocks of inter (P) pictures and 0.5 for each prediction block of a B picture (resulting into averaging). H.264/AVC allows weighted prediction for both P and B slices. In implicit weighted prediction, the weights are proportional to weighted prediction for both P and B slices. In implicit weighted prediction, the weights are proportional to picture order counts (POC), while in explicit weighted prediction, prediction weights are explicitly indicated. The motion vectors may be coded using the spatial/temporal candidates. However, this is not found efficient in the some types of video content because of the following:
- The video may include object/scene deformations, due to image format, and different sampling in various regions when representing 360-degree videos on 2D representation (e.g. equirectangular, cubemap projection).
- The video content may have zooming or rotation, caused by an object or a camera.
- The video content may be deformed due to the characteristics of the capturing device (e.g. fisheye lenses). In previous cases, magnitude and direction of motion vectors change gradually by the location of the block. Since the known motion vector prediction is based on the assumption that the motions of the neighboring block are very close or identical, it may be sub-optimal resulting in very large motion vector difference and consequently requires more bits. For the cases of very high object motion, intra prediction may be chosen by the encoder instead of inter prediction, which results higher bitrate compared to inter prediction.
Previously, the problem has been treated by using motion vectors from neighboring blocks for predicting the motion vector of the current block using linear function. Parameters of linear function are fixed, and are calculated using a training process on several training test sequences. However, such solution considers only the motion vector values for the modeling process with some fixed parameters and hence is not able to achieve a proper model for the motion. In the present solution, the problem is solved by modelling the motion vector locally based on the location (e.g. a center point or top-left corner) of the blocks in regions of a video frame. The parameters of the model are calculated based on the motion vectors and locations of at least one neighboring block comprising a set of motion vectors. Then a predicted motion vector is calculated using the model and the location of the current block. This predicted motion vector is added to a candidate list, which can be an advanced motion vector prediction list or merge list, as one of the candidates which can be selected by the encoder. The process may be executed at an encoder side or a decoder side when generating the candidate list. A method according to an embodiment is shown as a flowchart in Figure 4. The method comprises
- obtaining motion information comprising at least one motion vector and a location of at least one neighboring block 410;
- determining parameters of a model using the obtained motion information 420; and
- determining a predicted motion vector using the model and a location of a current block 430. The method illustrated in Figure 4 can be executed in an encoder or a decoder. When the method is executed at the encoder, the method further comprises adding the predicted motion vector to a candidate list 440.
The method of Figure 4 may be applied to a single layer coding, where all the neighboring blocks are on a same layer, or to a multilayer or multiview coding, where neighboring blocks are on the same layer/view, but may also be on other layers/views.
The model referred in the method of flowchart of Figure 4 may be defined as follows. It is assumed that the motion information of different blocks and/or regions of a video frame is modelled with a specific function. This function relates a motion vector (x and y components) of a block to its location (e.g. x and y coordinates of the center of the block). Different models can be used for this purpose. In one sense, the model can be a cross component, i.e. x and y component of motion vector are related to both x and y coordinates of block location. Alternatively, the model can be limited to each component, i.e. x and y components of motion vectors are only related to x and y coordinates of the block location, respectively, or vice versa (i.e., x to y, and y to x). In other sense, the model can be linear, polynomial, or any other general function. In the present solution, the model can be used to model the motion only in small region of the video, so linear model should efficiently work for this purpose.
Some examples of the model are presented below, where MVx and MVy are x and y components of motion vector of a block, X and Y are the location (e.g. center coordinates) of the block, f(.) can be any function like sinusoidal, exponential or logarithmic functions; X0, Y0, MVOx, MVOy are fixed values that, for example, may be calculated based on the motion vectors and locations of neighboring blocks;; and a, b, c are parameters that are obtained based on training process using neighboring blocks. The function relates the motion vector of a block to the location of that block in the frame with a weight and an offset matrix. The parameters used in the model comprise coefficients of the weight and the offset matrix of the model. It is appreciated that the models presented below are examples, and any other model can be used instead.
MVx Q-xx Q-xy~\ X Γ
MVy\ CLyX dyy\ |_ J^J |_,
Figure imgf000013_0001
MVx = axX + bx
MVy = ayY + b
Figure imgf000013_0002
Each model has several parameters that are calculated for each block using the motion information (i.e. a set of motion vectors from each block and location of each block) of at least one neighboring blocks. A neighboring block can be of the following: a block on a top of the current block, a block on left of the current block, a block on top-left of the current block, a block on bottom-left of the current block, a block on top-right of the current block. For this purpose, motion vectors and location (e.g. center point) of said at least one neighboring block are collected and a training process (e.g. linear regression, polynomial regression, logical regression, RANSAC (Random Sample Consensus) etc.) may be used to calculate the parameters.
To have better model and more accurate motion vector estimation, not only the adjacent neighboring blocks but also or instead the blocks in neighboring region may be used as the training set. Neighboring region relates to blocks locating at certain distance from the current block, wherein the distance may be defined by two or more blocks in a certain direction, e.g. up or left. The size of the neighboring region may be selected based on the size of the current block. For example, the blocks in the top, left, top-left, bottom-left and top-right regions of the current block may be used to train the model. The size of the neighboring block may be considered in training process. For example, the information of the larger blocks may have more influence on the model's parameters. In HEVC standard, for example, motion information may be stored in 4x4 block accuracy. So one alternative way, as shown in Figure 5, is to use the motion information of the 4x4 blocks in the neighboring regions. In this case, the motion vector of a block, when it is larger than 4x4, may be considered several times in the training process. By this, the influence of the motion vector of each block in model extraction becomes proportional to the size of that block.
The parameters of the model are calculated based on the motion vectors and locations of at least one neighboring block comprising a set of motion vectors. The more there are neighboring blocks, the more accurate the predicted motion vector will be. Each model that is used in the present solution has several parameters that should be calculated based on the neighboring block(s). If there are enough (i.e. more than number of model's parameters) motion vectors in the neighboring blocks, a fitting can be performed, e.g. by minimizing the mean square error. On the other hand, if there is exactly the same number of motion vectors in the neighboring blocks as the number of the parameters, "a system of equations" can be build, and the exact value of parameters can be calculated. If there are motion vectors of neighbors less than required, it is possible to make an assumption on some of the parameters (for example by assuming the parameters are 1 .0 or 0, or equal to each other), and the other parameters can be calculated using one of the abovementioned methods. It should be understood that each motion vector has two components, including x and y directions, which is counted as two values for training. So, for example, if a neighboring block is coded in uni-prediction mode, it has one motion vector, hence two values for the training, and if a neighboring block is coded in bi-prediction mode, it has two motion vectors, hence four values for the training process. Other information of the neighboring block (e.g. if it is coded using adaptive MVP mode or not, affine parameters) may be used in the training process. The calculated motion vector may be quantize to the precision of motion vector or motion vector difference. As another way, the predicted motion vector may be kept in higher precision.
During the parameter calculation process, some of the neighboring blocks may be related to different object than the object to which the current block belongs to. Therefore the blocks whose motion information are not in harmony with the current block or other neighboring blocks can be removed. This outlier removal can be done for example by running the training process twice. Instead, other simple methods can be used to eliminate outliers, for example by classifying the neighboring blocks into, for example, two classes, and for each one, to extract a separate model and motion vector predictor.
In HEVC standard, for example, each P-frame may have several reference frames, so each block may be predicted from different reference frames. Because of the motion in temporal domain, a block may have different motion vectors for different reference frames. So this issue needs to be carefully considered in training phase, and in motion vector prediction. In the case of AMVP list, motion vector of a block is predicted for a given reference frame (with specific POC). In the case of merge list, motion vector may be calculated for one reference frame (like uni-prediction) or for two/more reference frames (e.g. bi-predicted or OBMC). In both AMVP and merge cases, if the neighboring blocks are predicted from different reference frame(s) than the current reference frame, their motion information should be either eliminated from the training process, or should be scaled according to the POC numbers. In the merge case, as another example, priority may be given to one of the reference frames which is mostly used in the neighboring blocks. In the case of bi-prediction where there are two reference frames, the proposed process may be executed twice, one for each reference frame. And in each execution, the training process may be executed using the related neighboring motion information. As an alternative way it is possible to run the training process once for one of the reference frames, and extract the model, and then to estimate the motion vector for that reference frame. For the other reference frame, the motion vector may be calculated by scaling the motion vectors using POC values of the current frame and reference frames. This approach can reduce the computational complexity of the method.
In previous, it has been discussed that neighboring motion vectors come from the same layer and/or view. It is, however, also possible, particularly in a case of multilayer coding that neighboring motion vectors may come from other layers and views as well. For example, motion information of the blocks in the right and bottom side of the current block which are not coded yet, or the blocks that are coded in Intra mode, may come from previous layer(s) (after proper scaling) or other view(s). In such a case, two candidates may be added to the candidate list. One of the candidates may be calculated based on the neighboring blocks which are predicted from the same reference frame, and the other one may be calculated from the neighboring blocks of the co-located block in the other layer(s) or view(s). Figures 6 and 7 illustrate the motion vector information collecting in case of multilayer (scalable) and multiview prediction, respectively. It must be understood that the additional motion vector information that are collected from other layer(s) or view(s) are not limited to the illustrated ones, but can include any candidates that are not available in the current frame. In Figure 6, i.e. in the scalable predication, the reference layer 610 could be in the same or different resolution as the current layer 600. Moreover, the quality can be different to the current layer 600 video. For the motion vector information that are collected from the reference layer 610, two step scaling could be applied as following: scaling the motion vectors according the POC number and scaling the motion vectors according the resolution of the reference layer 610.
In addition, as discussed (and referred to in Figure 4), the method - when executed at an encoder - comprises adding the predicted motion information to a candidate list, e.g. to an advanced motion vector prediction list (AMVP) or to a merge list. Adding the motion vector to the list and its location in the list can be controlled by, for example, picture/frame/slice level flag. This flag, for example, can be disabled for videos with less motion activity or based on the video type (e.g. non 360-degree videos). The proposed motion vector candidate may be added to merge or AMVP list, at the beginning, middle or end of the list. It may be added to list by increasing the number of candidates, or may replace one of the candidates by fixing the length of the list. Based on some conditions (e.g. when neighboring blocks has very diverse motion vectors, or model parameters cannot be calculated), the proposed one may not be added to the list.
As mentioned, the solution presented in this description can be executed at an encoder or a decoder. At encoder, the proposed motion vector may be used as the initial point for motion search. However, at decoder, it is not needed to calculate the proposed motion vector prediction for all the blocks. This process may be executed only for those blocks that this motion vector is used as predictor. Therefore, the complexity at the decoder side can be reduced.
An apparatus according to an embodiment comprises means for obtaining motion information comprising a motion vector and locations of neighboring blocks of a video frame; means for determining parameters of a model using the obtained motion information; and means for determining a predicted motion vector using the model and a location of a current block When the apparatus is an encoder, the apparatus further comprises means for adding the predicted motion vector to a candidate list. These means comprises at least one processor, and a memory including computer program code comprising one or more operational characteristics.
Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises obtaining motion information comprising a motion vector and locations of neighboring blocks of a video frame; determining parameters of a model using the obtained motion information; determining a predicted motion vector using the model and a location of a current block. An example of an apparatus is shown in Figure 8. Figure 8 shows a block diagram of a video coding system according to an example embodiment as a schematic block diagram of an electronic device 50, which may incorporate a codec. In some embodiments the electronic device may comprise an encoder or a decoder. Figure 9 shows a layout of an apparatus according to an embodiment. The electronic device 50 may for example be a mobile terminal or a user equipment of a wireless communication system or a camera device. The electronic device 50 may be also comprised at a local or a remote server or a graphics processing unit of a computer. The device may be also comprised as part of a head-mounted display device.
The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 may further comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image 30 or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display.
The apparatus may comprise a microphone 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The apparatus may further comprise a camera 42 capable of recording or capturing images and/or video. The camera 42 is a multi-lens camera system having at least two camera sensors. The camera is capable of recording or detecting individual frames which are then passed to the codec 54 or the controller for processing. The apparatus may receive the video and/or image data for processing from another device prior to transmission and/or storage.
The apparatus 50 may further comprise an infrared port for short range line of sight communication to other devices. According to an embodiment, the apparatus may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB (Universal Serial Bus)/firewire wired connection. The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The apparatus or the controller 56 may comprise one or more processors or processor circuitry and be connected to memory 58 which may store data in the form of image, video and/or audio data, and/or may also store instructions for implementation on the controller 56 or to be executed by the processors or the processor circuitry. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of image, video and/or audio data or assisting in coding and decoding carried out by the controller. The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a UICC (Universal Integrated Circuit Card) and UICC reader for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network. The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The 30 apparatus 50 may further comprise an antenna 44 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es). The apparatus may comprise one or more wired interfaces configured to transmit and/or receive data over a wired connection, for example an electrical cable or an optical fiber connection. Such wired interface may be configured to operate according to one or more digital display interface standards, such as for example High-Definition Multimedia Interface (HDMI), Mobile High-definition Link (MHL), or Digital Visual Interface (DVI).
The various embodiments may provide advantages. For example, changes are local, with minimum changes in adding new candidate to the AMVP/Merge lists, and no change in the bitstream syntax. The model is generic and locally adaptive. It can model different changes in motion or deformation in objects in different ways. For example, it can support zooming in and out, rotation, and object deformation for example in 360- degree video formats. The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.
If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above- described functions and embodiments may be optional or may be combined.
Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.
It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:
I . A method for video coding, comprising:
- obtaining motion information comprising at least one motion vector and a location of at least one neighboring block of a video frame;
- determining parameters of a model using the obtained motion information; and
- determining a predicted motion vector using the model and a location of a current block.
2. The method according to claim 1 , wherein the model is a function that relates the motion vector of a block to the location of that block in the frame with a weight and an offset matrix.
3. The method according to claim 2, wherein the parameters comprise coefficients of the weight and the offset matrix of said model.
4. The method according to any of the claims 1 to 3, further comprising adding the predicted motion vector to a candidate list.
5. The method according to any of the claims 1 to 4, further comprising using the predicted motion vector as an initial search point for motion estimation.
6. The method according to claim 4, wherein the candidate list is an advanced motion vector predictor list.
7. The method according to claim 4, wherein the candidate list is a merge list.
8. The method according to any of the claims 1 to 7, wherein the neighboring blocks are on a same layer than the current block.
9. The method according to any of the claims 1 to 8, wherein the at least some of the neighboring blocks are on a different layer(s) or view(s) than the current block.
10. The method according to any of the claims 1 to 9, being executed at an encoder and/or in a decoder.
I I . An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: - obtain motion information comprising a motion vector and locations of neighboring blocks of a video frame;
- determine parameters of a model using the obtained motion information; and
- determine a predicted motion vector using the model and a location of a current block.
12. The apparatus according to claim 1 1 , wherein the model is a function that relates the motion vector of a block to the location of that block in the frame with a weight and an offset matrix.
13. The apparatus according to claim 12, wherein the parameters comprise coefficients of the weight and the offset matrix of said model.
14. The apparatus according to any of the claims 1 1 - 13, further comprising program code configured to cause the apparatus to add the predicted motion vector to a candidate list.
15. The apparatus according to any of the claims 1 1 - 14, further comprising program code configured to cause the apparatus to use the predicted motion vector as an initial search point for motion estimation.
16. The apparatus according to claim 14, wherein the candidate list is an advanced motion vector predictor list or a merge list.
17. The apparatus according to any of the claims 1 1 to 16, wherein the neighboring blocks are on a same layer than the current block.
18. The apparatus according to any of the claims 1 1 to 17, wherein the at least some of the neighboring blocks are on a different layer(s) or view(s) than the current block.
19. The apparatus according to any of the claims 1 1 to 18, wherein said apparatus is an encoder and/or a decoder.
20. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to implement a method according to any of the claims 1 to 10.
PCT/FI2018/050724 2017-10-16 2018-10-09 A method and an apparatus and a computer program product for video encoding and decoding Ceased WO2019077197A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20175906 2017-10-16
FI20175906 2017-10-16

Publications (1)

Publication Number Publication Date
WO2019077197A1 true WO2019077197A1 (en) 2019-04-25

Family

ID=66174325

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2018/050724 Ceased WO2019077197A1 (en) 2017-10-16 2018-10-09 A method and an apparatus and a computer program product for video encoding and decoding

Country Status (1)

Country Link
WO (1) WO2019077197A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112738521A (en) * 2020-12-03 2021-04-30 深圳万兴软件有限公司 Video encoding method and apparatus, electronic device, and storage medium
WO2021120122A1 (en) * 2019-12-19 2021-06-24 Oppo广东移动通信有限公司 Image component prediction method, encoder, decoder, and storage medium
CN113966616A (en) * 2019-06-04 2022-01-21 北京字节跳动网络技术有限公司 Motion candidate list construction using neighboring block information
US12328432B2 (en) 2020-03-07 2025-06-10 Beijing Bytedance Network Technology Co., Ltd. Implicit multiple transform set signaling in video coding
US12495141B2 (en) 2019-07-14 2025-12-09 Beijing Bytedance Network Technology Co., Ltd. Transform block size restriction in video coding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130163668A1 (en) * 2011-12-22 2013-06-27 Qualcomm Incorporated Performing motion vector prediction for video coding
US20140010306A1 (en) * 2012-07-04 2014-01-09 Thomson Licensing Method for coding and decoding a block of pixels from a motion model
US20140354771A1 (en) * 2013-05-29 2014-12-04 Ati Technologies Ulc Efficient motion estimation for 3d stereo video encoding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130163668A1 (en) * 2011-12-22 2013-06-27 Qualcomm Incorporated Performing motion vector prediction for video coding
US20140010306A1 (en) * 2012-07-04 2014-01-09 Thomson Licensing Method for coding and decoding a block of pixels from a motion model
US20140354771A1 (en) * 2013-05-29 2014-12-04 Ati Technologies Ulc Efficient motion estimation for 3d stereo video encoding

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LI, LI ET AL.: "An Efficient Four-Parameter Affine Motion Model for Video Coding", ARXIV.ORG, 21 February 2017 (2017-02-21), XP080747890, Retrieved from the Internet <URL:https://arxiv.org/abs/1702.06297> [retrieved on 20190201] *
PARKER, SARAH ET AL.: "Global and locally adaptive warped motion compensation in video compression", 2017 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 19 September 2017 (2017-09-19), pages 275 - 279, XP033322582, ISSN: 2381-8549, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/8296286> [retrieved on 20190201] *
SIXIN, LIN ET AL.: "Affine transform prediction for next generation video coding", ISO/IEC JTC1/SC29/WG11 MPEG2015/M37525. MPEG DOCUMENT MANAGEMENT SYSTEM, 26 October 2015 (2015-10-26), XP030065892, Retrieved from the Internet <URL:http://phenix.it-sudparis.eu/mpeg> [retrieved on 20190201] *
SULLIVAN, GARY J. ET AL.: "Overview of the High Efficiency Video Coding (HEVC) Standard", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 22, no. 12, 28 September 2012 (2012-09-28), pages 1649 - 1668, XP011486324, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/6316136> [retrieved on 20190201] *
YUAN, HUI ET AL.: "Affine Model Based Motion Compensation Prediction for Zoom", IEEE TRANSACTIONS ON MULTIMEDIA, vol. 14, no. 4, 8 March 2012 (2012-03-08), pages 1370 - 1375, XP011452912, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/6166365> [retrieved on 20190201] *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113966616A (en) * 2019-06-04 2022-01-21 北京字节跳动网络技术有限公司 Motion candidate list construction using neighboring block information
CN113966616B (en) * 2019-06-04 2023-11-14 北京字节跳动网络技术有限公司 Motion candidate list construction using neighboring block information
US12120314B2 (en) 2019-06-04 2024-10-15 Beijing Bytedance Network Technology Co., Ltd. Motion candidate list construction using neighboring block information
US12495141B2 (en) 2019-07-14 2025-12-09 Beijing Bytedance Network Technology Co., Ltd. Transform block size restriction in video coding
WO2021120122A1 (en) * 2019-12-19 2021-06-24 Oppo广东移动通信有限公司 Image component prediction method, encoder, decoder, and storage medium
US11477465B2 (en) 2019-12-19 2022-10-18 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Colour component prediction method, encoder, decoder, and storage medium
US11770542B2 (en) 2019-12-19 2023-09-26 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Colour component prediction method, encoder, and decoder
US12328432B2 (en) 2020-03-07 2025-06-10 Beijing Bytedance Network Technology Co., Ltd. Implicit multiple transform set signaling in video coding
CN112738521A (en) * 2020-12-03 2021-04-30 深圳万兴软件有限公司 Video encoding method and apparatus, electronic device, and storage medium

Similar Documents

Publication Publication Date Title
JP7596457B2 (en) METHOD AND APPARATUS FOR VIDEO DECODING BY INTER PREDICTION IN VIDEO CODING SYSTEM
JP7331095B2 (en) Interpolation filter training method and apparatus, video picture encoding and decoding method, and encoder and decoder
KR102873107B1 (en) Inter prediction method and device in a video coding system
JP7279154B2 (en) Motion vector prediction method and apparatus based on affine motion model
JP7738145B2 (en) Image decoding method and apparatus based on subblock-based motion prediction in an image coding system
US12212777B2 (en) Optical flow based video inter prediction
US12015762B2 (en) DMVR using decimated prediction block
US12160564B2 (en) Bilateral matching with affine motion
CN114208171B (en) Image decoding method and apparatus for deriving weight index information for generating prediction samples
WO2019077197A1 (en) A method and an apparatus and a computer program product for video encoding and decoding
CN111107373A (en) Method and related device for inter-frame prediction based on affine prediction mode
US11729424B2 (en) Visual quality assessment-based affine transformation
CN119922328A (en) Decoding and encoding equipment and data storage and transmission equipment
US11949874B2 (en) Image encoding/decoding method and device for performing prof, and method for transmitting bitstream
JP2024125405A (en) Image encoding/decoding method and device performing BDOF, and method for transmitting bitstream
US12388977B2 (en) Affine models use in affine bilateral matching
WO2017093604A1 (en) A method, an apparatus and a computer program product for encoding and decoding video
US20230179761A1 (en) Method and apparatus for inter prediction in video processing system
CN118339828A (en) Method and apparatus for affine motion thinning
WO2020234512A2 (en) A method, an apparatus and a computer program product for video encoding and video decoding
CN121151545A (en) Decoding equipment, encoding equipment, and image data transmission equipment
HK40058476A (en) Optical flow based video inter prediction
HK40058476B (en) Optical flow based video inter prediction
CN121151546A (en) Decoding equipment, encoding equipment, and image data transmission equipment
CN121151548A (en) Image decoding methods, image encoding methods, and image data transmission methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18867806

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18867806

Country of ref document: EP

Kind code of ref document: A1