[go: up one dir, main page]

WO2012050185A1 - Dispositif de traitement vidéo, procédé de traitement vidéo et programme de traitement vidéo - Google Patents

Dispositif de traitement vidéo, procédé de traitement vidéo et programme de traitement vidéo Download PDF

Info

Publication number
WO2012050185A1
WO2012050185A1 PCT/JP2011/073639 JP2011073639W WO2012050185A1 WO 2012050185 A1 WO2012050185 A1 WO 2012050185A1 JP 2011073639 W JP2011073639 W JP 2011073639W WO 2012050185 A1 WO2012050185 A1 WO 2012050185A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
foreground
video
image
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2011/073639
Other languages
English (en)
Japanese (ja)
Inventor
健史 筑波
正宏 塩井
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sharp Corp
Original Assignee
Sharp Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sharp Corp filed Critical Sharp Corp
Publication of WO2012050185A1 publication Critical patent/WO2012050185A1/fr
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds

Definitions

  • the present invention relates to a video processing device, a video processing method, and a video processing program.
  • terminals with an imaging function such as a digital video camera, a digital still camera, and a mobile phone are rapidly spreading.
  • an apparatus that generates a new video by processing and processing video shot by these cameras. For example, a specific image area in a video can be extracted, and the extracted image area can be used as a part for video processing, or a video can be extracted for each individual image area, and the extracted video can be managed in the original video. It is known to use for search. Specifically, the following methods (1) and (2) are known as methods for extracting a desired image region from a video.
  • the chroma key process is a process of extracting a desired object (foreground area) by capturing an object with a certain color (for example, blue) as a background and removing a background portion of the certain color from the captured image. With this process, the video is separated into a video in the foreground image area and a video in the background image area.
  • Non-Patent Document 1 for a gray image, a user represents a marker representing a foreground area in a desired image area (foreground image area) and a marker representing a background area in another area (background image area) in advance.
  • a technique for extracting the foreground image region by graph cuts based on the markers assigned to the foreground image region and the background image region is described.
  • Patent Document 2 describes a technique for extracting a foreground image region by applying graph cuts to a color image.
  • Non-Patent Document 3 discloses a map (trimap) using three markers: a foreground image region, a background image region, and an unknown image region (an image region that has not been determined to belong to either the foreground image region or the background image region). In other words, a technique for extracting the foreground image area by estimating the mixture ratio ⁇ (mat) of the pixels in the foreground image area and the background image area in the unknown area is created.
  • Patent Document 1 Foreground Image Region Extraction Method Based on Color Information and Depth Information
  • Patent Document 2 describes a technique for extracting a foreground image area by roughly extracting a foreground image area from depth information from a camera to a subject, and then recursively repeating an area division and integration method based on color information. Yes.
  • the chroma key process when an object (foreground area) includes a constant color or a similar color of the background, it is determined that the area is a background image area, that is, the object is reliably extracted. There was a disadvantage that it was not possible.
  • the chroma key process has a drawback that when the background is not a constant color, a non-constant color portion of the background image area is extracted as an object, that is, the object cannot be reliably extracted.
  • Non-Patent Document 1-3 and Patent Documents 1 and 2 when the color distribution of the object and the background area is similar, or when the object and the background area have a similar pattern (texture) The boundary of the region cannot be specified, and a defective portion is generated in the object.
  • the background area may be erroneously extracted as an object. That is, the conventional technique has a drawback that the object cannot be reliably extracted.
  • the extracted shape of the target object becomes discontinuous in the time direction, so that flicker or Flickering will occur.
  • the present invention has been made in view of the above points, and provides a video processing apparatus, a video processing method, and a video processing program capable of reliably extracting an image of an object.
  • the present invention has been made to solve the above problems, and one aspect of the present invention is a video processing apparatus that extracts foreground area information indicating a foreground image from a video indicated by video information.
  • a video processing apparatus comprising: a foreground area correction unit that corrects the foreground image indicated by the foreground area information at the first time using the foreground area information and the video information at the second time.
  • the foreground area correction unit converts the foreground image indicated by the foreground area information at the first time into the foreground area information at the plurality of second times. And correction using video information.
  • the video indicated by the video information includes an image of a predetermined object
  • the foreground area correction unit performs the first time. Based on the video information, the target image area information indicating the image of the object, and the video information, the foreground area information, and the target image area information at the second time, from the second time to the first time.
  • the portion in the video at the first time is A foreground region probability map generation unit that calculates a probability of being a foreground image, and foreground region information at the first time is extracted based on the probability calculated by the foreground image probability map generation unit, and the extracted foreground region information indicates Correct foreground images Characterized in that it comprises a Tadashibu, the.
  • a video processing method in a video processing apparatus that extracts foreground area information indicating a foreground image from a video indicated by the video information, wherein the foreground area correction unit performs the first time.
  • the video indicated by the video information includes an image of a predetermined object
  • the foreground area correction step includes a movement amount calculation unit The second time based on the video information and the target image area information indicating the image of the object at the first time, and the video information, foreground area information and the target image area information at the second time.
  • a movement amount calculating step for calculating a movement amount by which the foreground image has moved during a first time from a first time, and a movement amount calculated by the foreground region probability map generation unit in the movement amount calculation step and the foreground image Based on the foreground region probability map generation step for calculating the probability that the portion in the video at the first time is the foreground image, and the correction unit calculates the probability based on the probability calculated in the foreground image probability map generation step.
  • a correction step of extracting the foreground area information to correct the foreground image indicated by, and having a.
  • a foreground image indicated by the foreground area information at the first time is transmitted to a computer of a video processing apparatus that extracts foreground area information indicating a foreground image from an image indicated by the image information.
  • the video indicated by the video information includes an image of a predetermined object
  • the foreground region correction procedure is performed at the first time.
  • the target image area information indicating the image of the object
  • the video information, the foreground area information, and the target image area information at the second time from the second time to the first time.
  • the movement amount calculation procedure for calculating the movement amount by which the foreground image has moved in between the movement amount calculated by the movement amount calculation procedure and the foreground image
  • the portion in the video at the first time is the foreground.
  • a foreground region probability map generation procedure for calculating a probability of being an image of the foreground, foreground region information at a first time is extracted based on the probability calculated by the foreground image probability map generation procedure, and the foreground region information indicated by the extracted foreground region information is Picture A video processing program for correcting procedure, to the execution of correcting.
  • the foreground image area or the background image area can be reliably extracted.
  • FIG. 1 is a block diagram showing a configuration of a video processing apparatus 1 according to an embodiment of the present invention.
  • a video processing apparatus 1 includes a video information acquisition unit 10, a depth information acquisition unit 11, a video information reproduction unit 12, a ROI (Region of Interest; target image region) acquisition unit 13, a video display unit 14, and an object extraction unit. 15 and a mask information recording unit 16.
  • the video information acquisition unit 10 acquires video information (t).
  • the video information (t) is video information of a moving image and is a function of time t (elapsed time from the start time of the moving image).
  • the video information of the present invention is not limited to this, and may be video information of a plurality of still images.
  • the video information (t) may be a moving image or a still image including images that are continuous or adjacent in time with the imaging device fixed, or images captured from consecutive or adjacent positions at the same time. (In the latter case, video information is a function of position).
  • the video information acquisition unit 10 may acquire the video information (t) from the imaging device, or may acquire the video information (t) recorded in advance in a recording unit or an external recording device. .
  • the video information acquisition unit 10 outputs the acquired video information (t) to the video information reproduction unit 12, the ROI acquisition unit 13, and the object extraction unit 15.
  • the depth information acquisition unit 11 acquires depth information (t) of the video information (t) acquired by the video information acquisition unit 10.
  • the depth information (t) is information representing the distance from the imaging device to the imaging object for each pixel of the video information (t).
  • the depth information acquisition unit 11 outputs the acquired depth information (t) to the object extraction unit 15.
  • the video information reproduction unit 12 generates a video signal for controlling the output of each pixel at each time t of the video display unit 14 based on the video information (t) input from the video information acquisition unit 10.
  • the video information reproduction unit 12 displays the video on the video display unit 14 by outputting the generated video signal to the video display unit 14. That is, the video information reproduction unit 12 reproduces the video of the video information (t) and causes the video display unit 14 to display the reproduced video.
  • the video information reproducing unit 12 superimposes the image of the object extracted by the object extracting unit 15 on the video of the video information (t) based on the mask information (t) recorded by the mask information recording unit 16. indicate. That is, the mask information (t) is associated with the video information (t) at time t. When the mask information recording unit 16 does not record the mask information (t), the video of the video information (t) is reproduced as it is.
  • the video display unit 14 is a touch panel type display.
  • the video display unit 14 displays the video of the video information (t) by controlling the output based on the video signal input from the video information reproducing unit 12.
  • the video display unit 14 converts information indicating the touched position into information indicating the position of the video information (ts) in the image (ts) at a certain time ts.
  • the video display unit 14 specifies the ROI in the image (ts) displayed on the display while the user touches the display, whereby the ROI position information (referred to as user-specified ROI information (ts)), that is, the position of the ROI.
  • information indicating the shape (circumscribed shape) are detected. The details of the process for detecting the user-designated ROI information (ts) in the video display unit 14 will be described later.
  • the video display unit 14 outputs the detected user-specified ROI information (ts) to the ROI acquisition unit 13.
  • the ROI acquisition unit 13 is based on the image within the range of the user-specified ROI information (ts) input from the video display unit 14, and the image of the video information (t) at each time t other than the time ts (the image of each frame). ; Hereinafter referred to as processed image (t)), an image that matches or is similar to the image of the user specified ROI information (ts) is detected. Thereafter, the ROI acquisition unit 13 extracts information indicating the position and shape of the matching or similar image as ROI information (t).
  • the ROI acquisition unit 13 calculates and records feature points (referred to as ROI feature points (ts)) from an image within the range of the user-specified ROI information (ts).
  • a feature point is a characteristic point in an image, for example, a point that is extracted as a part or a vertex of a subject based on a change in color or luminance between pixels. Is not limited to this.
  • the ROI acquisition unit 13 calculates an image feature point (t) of the processed image (t) at each time t.
  • the ROI acquisition unit 13 performs matching between the image feature point (t) and the ROI feature point (ts).
  • the ROI acquisition unit 13 sequentially multiplies the ROI feature point (ts) by a transformation matrix, thereby moving (including rotating) the ROI feature point (ts) and enlarging / reducing the feature.
  • the number of points that match (referred to as the number of feature points) is calculated.
  • the ROI acquisition unit 13 determines that the number of feature points is equal to or greater than a predetermined threshold, the ROI acquisition unit 13 records the conversion matrix at that time.
  • the ROI acquisition unit 13 sets the position information obtained by multiplying the user-specified ROI information (ts) by the transformation matrix as ROI information (t). That is, the ROI acquisition unit 13 determines which part in the image (t) the image within the range of the user-specified ROI information (ts) matches.
  • the ROI acquisition unit 13 outputs the extracted ROI information (t) (including user-specified ROI information (ts)) to the video display unit 14 and the object extraction unit 15.
  • the ROI acquisition unit 13 stores the extracted ROI information (t) in the ROI information storage unit 1583.
  • the object extraction unit 15 receives video information (t) from the video information acquisition unit 10, receives depth information (t) from the depth information acquisition unit 11, and receives ROI information (t) from the ROI acquisition unit 13. .
  • the object extraction unit 15 generates mask information (t) at each time t using the input video information (t), depth information (t), and ROI information (t). Details of processing performed by the object extraction unit 15 will be described later.
  • the object extraction unit 15 records the extracted mask information (t) in the mask information recording unit 16.
  • FIG. 2 is a schematic diagram illustrating an example of detection processing for user-specified ROI information (ts) according to the present embodiment.
  • a square with a symbol A is a touch panel display (video display unit 14).
  • symbol O represents the image of the target object (a person in FIG. 2) which a user wants to extract.
  • symbol U represents a user's hand.
  • FIG. 2 is a diagram when a rectangular (quadrangle) selection tool is used, and shows that the user has surrounded the target image to be extracted with the rectangular selection tool.
  • the position information of the frame (the circumscribed rectangle of the object O) labeled with the symbol r1 is user-specified ROI information (ts).
  • the user-specified ROI information (ts) is recorded, for example, as data in Table 1 below.
  • the user-specified ROI information includes time ts (or a number (frame number) attached to a video frame), and whether or not there is an extraction target image (target image) in a circumscribed rectangle.
  • Presence / absence flag referred to as extraction target flag
  • circumscribed rectangle starting point position x0, y0
  • circumscribed rectangle width length indicated by reference sign W1 in FIG. 2
  • circumscribed rectangle Is represented by the vertical width (the length represented by the symbol L1 in FIG. 2).
  • FIG. 3 is a flowchart showing an example of the operation of the video processing apparatus 1 according to the present embodiment.
  • Step S11 The video information acquisition unit 10 acquires video information (t), and outputs the acquired video information (t) to the video information playback unit 12, the ROI acquisition unit 13, and the object extraction unit 15.
  • the depth information acquisition unit 11 acquires depth information (t), and outputs the acquired depth information (t) to the object extraction unit 15. Thereafter, the process proceeds to step S12.
  • Step S12 The video information reproducing unit 12 reproduces the video of the video information (t) input in step S11, and causes the video display unit 14 to display the reproduced video. Thereafter, the process proceeds to step S13.
  • Step S13 The user pauses the reproduction of the video reproduced in step S12 at a certain time ts and designates the ROI.
  • the video display unit 14 detects user-specified ROI information (ts) for the ROI specified by the user, and outputs it to the ROI acquisition unit 13. Then, it progresses to step S14.
  • Step S14 The ROI acquisition unit 13 extracts the ROI information (t) at each time t based on the user-specified ROI information (ts) detected in Step S13.
  • the ROI acquisition unit 13 outputs the extracted ROI information (t) to the video display unit 14 and the object extraction unit 15.
  • the video display unit 14 displays the circumscribed shape indicated by the ROI information (t) (the circumscribed rectangle in the case of Table 1 and the circumscribed circle in the case of Table 2) superimposed on the image of the video information (t) at that position. To do. Then, it progresses to step S2.
  • Step S2 The object extraction unit 15 performs object extraction using the video information (t) and depth information (t) acquired in step S11, and the ROI information (t) extracted in step S14, and mask information (T) is generated.
  • the object extraction unit 15 records the generated mask information (t) in the mask information recording unit 16. Thereafter, the process proceeds to step S15.
  • Step S15 Based on the mask information (t) recorded by the mask information recording unit 16, the video information reproducing unit 12 superimposes the object image extracted by the object extracting unit 15 on the video of the video information (t). To display.
  • the video processing device 1 may perform processing related to all input video information (t) at time t, or video information (t in a range specified by the user) ) (T1 ⁇ t ⁇ t2) may be performed.
  • the video processing apparatus 1 detects the user-specified ROI information (ts) for the ROI specified by the user, so that the ROI can be reliably extracted as compared with the case where the ROI is automatically extracted. it can.
  • the video display unit 14 displays the circumscribed shape indicated by the ROI information (t) so as to be superimposed on the video of the video information (t), so that the user has detected a desired ROI. I can understand that.
  • FIG. 4 is a schematic block diagram illustrating the configuration of the object extraction unit 15 according to the present embodiment.
  • the object extraction unit 15 includes filter units 151a and 151b, a distribution model estimation unit 152, a clustering unit 153, a feature amount calculation unit 154, a foreground region extraction unit 155, a foreground region correction unit 156, a mask information generation unit 157, And a buffer unit 158.
  • the buffer unit 158 includes a video information storage unit 1581, a foreground region information storage unit 1582, an ROI information storage unit 1583, an ROI depth distribution information storage unit 1584, and a corrected foreground region information storage unit 1585.
  • parallelograms denoted by reference signs I1, D1, R1, and M indicate information, and are video information (t), depth information (t), ROI information (t), and mask information (t), respectively.
  • the filter unit 151a removes noise from the input video information (t) and performs a smoothing process. Specifically, the filter unit 151a uses a smoothing filter (hereinafter also referred to as an edge holding smoothing filter) that holds an edge (contour) for each color component with respect to the processed image (t) at each time t. Do.
  • the filter unit 151a uses a bilateral filter represented by the following equation (1) as a smoothing filter.
  • the input image is f (x, y)
  • the output image is g (x, y)
  • W is the window size to which filtering is applied
  • ⁇ 1 is a parameter for controlling the weighting coefficient related to the inter-pixel distance (standard deviation of Gaussian distribution)
  • ⁇ 2 represents a parameter (standard deviation of Gaussian distribution) for controlling a weighting coefficient relating to a difference in pixel values.
  • the filter unit 151 a outputs the video information (t) smoothed by the smoothing filter to the clustering unit 153 and the feature amount calculation unit 154.
  • the filter unit 151b removes noise from the input depth information (t) and performs a smoothing process. Specifically, the filter unit 151b performs an edge holding smoothing filter. Thereby, the filter part 151b removes the noise in the horizontal direction generated by occlusion (shielding).
  • the filter unit 151 b outputs the depth information (t) smoothed by the smoothing filter to the feature amount calculation unit 154 and the distribution model estimation unit 152.
  • the distribution model estimation unit 152 calculates the depth distribution model in the ROI. Estimate the parameters. Specifically, the distribution model estimation unit 152 acquires the parameters of the distribution model by the maximum likelihood estimation method using the following equation (2), that is, GMM (Gaussian Mixture Model) expressed by a mixed model of Gaussian distribution. .
  • GMM Gaussian Mixture Model
  • the acquired parameter is referred to as ROI depth distribution information (t).
  • P (x) represents the probability that the vector x appears.
  • w i represents a weighting factor of a Gaussian distribution of class i
  • ⁇ i represents an average vector of class i
  • ⁇ i represents a covariance matrix of class i
  • D represents the number of dimensions of vector x.
  • ⁇ i , ⁇ i ) represents a Gaussian distribution of class i, and is expressed using a mean vector ⁇ i and a covariance matrix ⁇ i .
  • the distribution model estimation unit 152 obtains each parameter of the distribution model using an EM (Expectation-Maximization) algorithm.
  • the distribution model estimation unit 152 determines the depth distribution of the extraction target region in the ROI as the distribution of the class having the maximum weighting coefficient w i . That is, the depth distribution is calculated by assuming that the area occupied by the area in the ROI is large.
  • the distribution model estimation unit 152 outputs the estimated ROI depth distribution information (t) to the foreground region extraction unit 155 and the buffer unit 158.
  • the clustering unit 153 performs clustering on the smoothed video information (t) input from the filter unit 151a for each processed image (t), thereby processing the processed image (t) into a plurality of regions (superpixels). (Also called). For example, the clustering unit 153 performs clustering in the feature amount space.
  • the clustering by the feature amount space means that each pixel of the image space is mapped to the feature amount space (for example, color, edge, motion vector), and the K-means method, the Mean-Shift method, or the K nearest neighbor in the feature amount space. Clustering is performed by a technique such as a search method (approximate K nearest neighbor search method).
  • the clustering unit 153 divides the processed image (t) into a set (region; class) of pixels having similar feature quantities (feature quantity values are within a predetermined range).
  • the clustering unit 153 replaces the pixel value in the original image space for the pixels in the class with the pixel value (for example, the average value) that is the representative value of each region after the clustering process in the feature amount space is completed.
  • the clustering unit 153 assigns a label for identifying the region to each pixel in each region, and outputs region information (t). Details of the clustering unit 153 will be described below.
  • FIG. 5 is a schematic block diagram illustrating the configuration of the clustering unit 153 according to the present embodiment.
  • the clustering unit 153 includes a feature amount detection unit 1531, a seed generation unit 1532, a region growth unit 1533, and a region integration unit 1534.
  • the clustering unit 153 performs clustering using feature values based on edges and colors, but the present invention is not limited to this feature value.
  • the feature amount detection unit 1531 receives the smoothed video information (t).
  • the feature amount detection unit 1531 calculates the feature amount of the pixel (x, y) (x and y are coordinates representing the position of the pixel in the image) in each processed image (t).
  • the feature amount detection unit 1531 applies a differential operator for each color component (for example, RGB (Red (red), Green (green), Blue (blue))), and each color in the x direction and the y direction.
  • the feature amount detection unit 1531 calculates the edge strength E 2 (x, y
  • t) is a predetermined threshold for the coordinates (x, y) at time (t). Equation (3) indicates that E 2 (x, y
  • t) 0 when E 1 (x, y
  • t) E 1 (x, y
  • the feature amount detection unit 1531 may adjust the threshold value TH_E (x, y
  • the seed generation unit 1532 generates seed information for generating a super pixel using the edge intensity E 2 (x, y
  • the seed generation unit 1532 has a minimum value of the edge strength E 2 (x, y
  • t) is set to “1”, and otherwise, the seed information S (x, y
  • W 1 represents the size of the window in the x direction
  • W 2 represents the size of the window in the y direction.
  • the seed generation unit 1532 outputs the generated seed information S (x, y
  • the region growing unit 1533 applies the region growing method based on the seed information S (x, y
  • the region growing unit 1533 starts this processing from a pixel whose seed information S (x, y
  • the region growing unit 1533 sets the expanded region as the super pixel group R 1 (t), and outputs information indicating the super pixel group R 1 (t) to the region integrating unit 1534.
  • the region integration unit 1534 performs region integration processing for integrating a super pixel group having a small area area into another region from the super pixel group R 1 (t) indicated by the information input from the region growth unit 1533. Specifically, the region integration unit 1534 uses one point of each superpixel group R 1 (t) as a vertex, and is represented by a connection relationship between the vertices and an edge (weight) between the connected vertices. A direction graph is calculated. Here, for an edge (weight) between vertices, for example, a distance in a color space between each vertex and a representative color of each corresponding superpixel is used.
  • the region integration unit 1534 performs region integration on the weighted undirected graph so as to form a minimum spanning tree (MST) using a greedy method, and generates a super-cell group R 2 (t). For each superpixel, the region integration unit 1534 assigns a label for identifying the superpixel to all the pixels in the superpixel, and outputs the labeling result as region information (t).
  • MST minimum spanning tree
  • FIG. 6 is a schematic diagram illustrating an example of a processing result of the clustering unit 153 according to the present embodiment.
  • This figure is an image of a bear doll riding a train toy.
  • bear toys are also moving as train toys move along the tracks.
  • FIG. 6A by reference numeral 6A is an image representing the edge strength E 2 obtained from the input image.
  • FIG. 6B with numeral 6B is an image representing the seed information S obtained from the edge intensity E 2.
  • FIG. 6C attached with reference numeral 6C is an image showing the super pixel group R 1 (t)
  • FIG. 6D attached with reference numeral 6D is an image showing the super pixel group R 2 (t).
  • FIG. 6A by reference numeral 6A is an image representing the edge strength E 2 obtained from the input image.
  • FIG. 6B with numeral 6B is an image representing the seed information S obtained from the edge intensity E 2.
  • FIG. 6C attached with reference numeral 6C is an image showing the super pixel group R 1 (t)
  • a bright (white) portion represents a region having a high edge strength
  • a dark (black) portion represents a region having a low edge strength
  • a bright (white) portion represents a seed
  • a dark (black) portion represents a portion that is determined by a region growth method to which region (class) belongs.
  • FIG. 6C and FIG. 6D there are a large number of superpixels (regions) having a small area in the superpixel group R 1 (t) in FIG. 6C, but small in the superpixel group R 2 (t) in FIG. 6D. The number of superpixels (areas) having an area is decreasing. As described above, the video processing apparatus 1 can obtain a more accurate clustering result by performing the clustering process with fewer superpixels (regions) having a small area.
  • the feature amount calculation unit 154 receives the region information (t) from the clustering unit 153, the smoothed video information (t) from the filter unit 151a, and the depth information (t) from the filter unit 151b. ) And ROI information (t).
  • the feature amount calculation unit 154 calculates a feature amount for each region (label) based on the input region information (t), video information (t), depth information (t), and ROI information (t). Specifically, the feature amount calculation unit 154 calculates the following feature amounts (1) to (7). Thereafter, feature amount information (t) indicating the calculated feature amount is output to the foreground region extraction unit 155.
  • FIG. 7 is a flowchart illustrating an example of the operation of the feature amount calculation unit 154.
  • FIG. 8 is a diagram illustrating an example of labeling (region information) in an 8 ⁇ 8 pixel block for explanation.
  • FIG. 9 is a diagram illustrating an example of a connection relation between regions in FIG. 8 expressed by an unweighted undirected graph and an adjacency matrix.
  • FIG. 10 is a diagram illustrating an example of the method of obtaining the perimeter of the region and the circumscribed rectangle of the region, taking the label 3 in FIG. 8A as an example.
  • Step S154-01 The feature amount calculation unit 154 receives the region information (t) from the clustering unit 153, the video information (t) after smoothing from the filter unit 151a, and the depth information (t) after smoothing from the filter unit 151b. , And ROI information (t). Thereafter, the process proceeds to step S154-02.
  • Step S154-02 The feature amount calculation unit 154 sums the pixel coordinate values for the pixels in the ROI region based on the ROI information (t). Subsequently, the feature amount calculation unit 154 divides the total value by the number of pixels in the ROI area, and uses the calculation result as the center of gravity of the ROI area. Thereafter, the process proceeds to step S154-03.
  • Step S154-03 Based on the region information (t), the feature amount calculation unit 154 scans the processing target image line by line from the origin (raster scan), and coordinates and the number of pixels of all pixels belonging to each label The position (start point) where the pixel belonging to each label first appears and the adjacency relationship between the labels (regions) are obtained. Further, the position (start point) at which the pixel belonging to each acquired label first appears is stored as the start point of contour tracking when acquiring the area perimeter in step S154-08. Thereafter, the process proceeds to step S154-04.
  • an example of a method for obtaining the detection result of the start position of the label and the adjacent relationship between the labels will be described with reference to FIGS.
  • the label 1 is adjacent to the labels 3 and 4.
  • the adjacency relationship between the labels regarding the region information in FIG. 8A can be finally expressed as an unweighted undirected graph shown in FIG.
  • FIG. 9B each node number corresponds to each label number, and an edge between nodes represents a connection relationship.
  • the structure of the graph of FIG. 9B is expressed by the adjacency matrix shown in FIG.
  • “1” is assigned as a value when there is an edge between nodes
  • “0” is assigned as a value when there is no edge between nodes.
  • Step S154-04 The feature amount calculation unit 154 sums the coordinate values for the pixels belonging to each label Li (0 ⁇ i ⁇ MaxLabel).
  • MaxLabel represents the total number of labels for identifying superpixels (regions) acquired from the region information (t).
  • the feature amount calculation unit 154 divides the total value by the number of pixels belonging to the label Li, and uses the result as the center of gravity of the label Li. Then, the distance (center of gravity distance) between the center of gravity of the ROI area obtained in step S154-02 and the center of gravity of the label Li is calculated. Thereafter, the process proceeds to step S154-05.
  • Step S154-05 The feature amount calculation unit 154 calculates the average value, median value, variance value, and standard value for each color component of each label Li (0 ⁇ i ⁇ MaxLabel) from the smoothed video information (t). Calculate the deviation. Thereafter, the process proceeds to step S154-06.
  • Step S154-06 The feature amount calculation unit 154 calculates the average value, median value, variance value, and standard deviation of the depth of each label Li (0 ⁇ i ⁇ MaxLabel) from the smoothed depth information (t). calculate. Thereafter, the process proceeds to step S154-07.
  • Step S154-07 The feature amount calculation unit 154 sets the total number of pixels belonging to the label Li as a region area. Thereafter, the process proceeds to step S154-08.
  • Step S154-08 The feature amount calculation unit 154 calculates the area perimeter of the label Li. Thereafter, the process proceeds to step S154-09.
  • the area peripheral length is a movement amount that goes around the area clockwise (or counterclockwise) from the start point of the label in FIG. 10A. 8 when connected components of the coupling, and C 1 number to track moving up and down and right and left, there are a C 2 to track moving obliquely, region perimeter (Perimeter) has the formula (5) or by the formula (6) Calculated.
  • Step S154-09 The feature amount calculation unit 154 calculates the minimum rectangle (circumscribed rectangle) circumscribing the area indicated by the label Li. Thereafter, the process proceeds to step S154-10.
  • a circumscribed rectangle acquisition method will be described with reference to FIG. In FIG.
  • Equation (7) the symbol Li represents a label number
  • the symbol R Li represents a set of pixels belonging to the label Li
  • the symbols x j and y j represent the pixels j belonging to the set R Li , respectively. Represents the x coordinate and y coordinate.
  • Step S154-10 When the calculation of the feature values of all the labels is completed (Yes in Step S154-10), the feature value calculation unit 154 obtains the feature value information (t) including the feature value of each label (region). The data is output to the foreground area extraction unit 155. If there is an unprocessed label for which the feature amount is calculated (No in step S154-10), the process returns to step S154-04 to calculate the feature amount of the next label.
  • the feature quantities (1) to (7) can be calculated.
  • the operation of the feature amount calculation unit 154 has been described in the order of steps S154-01 to S154-10.
  • the present invention is not limited to this, and can be changed within a range in which the present invention can be implemented.
  • an adjacency matrix is used as an example of a data structure representing a connection relationship between regions, but the present invention is not limited to this, and an adjacency list may be used.
  • the feature amount calculation unit 154 calculates the feature amount using RGB as the color space of the image.
  • YCbCr (YUV), CIE L * a * b * (Elster, Aster) , Biester), CIE L * u * v * (Elster, Yuster, Baster) or other color space.
  • the foreground region extraction unit 155 receives the feature amount information (t) from the feature amount calculation unit 154, the ROI depth distribution information (t), and the ROI information (t) from the distribution model estimation unit 152. Further, the foreground area extraction unit 155 reads the video information (t) from the video information storage unit 158 of the buffer unit 158.
  • the read video information (t) is information stored by the video information acquisition unit 10 and not subjected to the smoothing process, but is not limited thereto, and the video information subjected to the smoothing process is not limited thereto. (T) may be used.
  • the foreground region extraction unit 155 extracts the foreground image region (t) to be extracted from the video information (t) based on the ROI information (t), the feature amount information (t), and the ROI depth distribution information (t). To do.
  • the foreground area extraction unit 155 stores the foreground area information (t) indicating the extracted foreground image area (t) in the foreground area information storage unit 1582.
  • FIG. 11 is a flowchart showing an example of the operation of the foreground area extraction unit 155 according to the present embodiment.
  • the foreground area extraction unit 155 sets parameters for search conditions for searching for a basic foreground area, which is a core area for extracting the foreground area. Specifically, the foreground region extraction unit 155 sets predetermined values as the lower limit value and the upper limit value of each feature amount information (t).
  • Foreground region extraction unit 155 for example, lower limit value (referred to as minimum area) of region area, upper limit value (referred to as maximum area), lower limit value of region peripheral length (referred to as minimum peripheral length), maximum value (referred to as maximum peripheral length), As an initial value of the upper limit value of the center-of-gravity distance, the maximum value (maximum distance) of the radius of the circle inscribed in the circumscribed rectangle of the ROI area is set as a parameter.
  • the search conditions for the basic foreground region in this way, it is possible to detect a region having a center of gravity and a large area within the ROI. In addition, it is possible to prevent erroneous detection of a region belonging to the background region as a basic foreground region.
  • Step S155-02 The foreground region extraction unit 155 selects a region having the minimum centroid distance from among the regions that satisfies the lower limit value of the centroid distance or less than the upper limit value. Thereafter, the process proceeds to step S155-03.
  • Step S155-03 The foreground region extraction unit 155 determines whether or not the feature amount information (t) of the region selected in step S155-02 is a value between the lower limit value and the upper limit value set in step S155-01. Determine whether. When it is determined that the feature amount information (t) is a value between the lower limit value and the upper limit value (Yes), that is, it is determined that the region selected in step S155-02 is the basic foreground region, and step S155- Proceed to 05. On the other hand, when it is determined that the feature amount information (t) is not a value between the lower limit value and the upper limit value (No), the process proceeds to step S155-04.
  • Step S155-04 The foreground region extraction unit 155 subtracts a predetermined value from the lower limit value of each feature amount information (t), or adds a predetermined value to the upper limit value, so that the feature amount information (T) Each lower limit value and upper limit value are updated. Thereafter, the process proceeds to step S155-02.
  • the foreground region extraction unit 155 calculates the average value related to the depth among the feature amount information (t) of the basic foreground region determined in Step S155-03 and each region having a center of gravity within the ROI, or The median values are compared, and it is determined whether or not the difference between the pieces of feature amount information (t) is within a predetermined threshold value (or less than a predetermined threshold value).
  • the foreground area extraction unit 155 integrates the area for which the difference in the feature amount information (t) is determined to be within a predetermined threshold and the basic foreground area, and determines the integrated area as the foreground area.
  • the foreground region extraction unit 155 determines a threshold value of the difference between the feature amount information (t) based on the ROI depth distribution information (t) acquired from the distribution model estimation unit 152. Specifically, the foreground region extraction unit 155 calculates a threshold value (TH_D1) of the difference between the feature amount information (t) using the following equation (8).
  • ⁇ _1 is a predetermined scaling constant.
  • ⁇ _1 is a standard deviation when the depth distribution of the foreground region is assumed to be a Gaussian distribution.
  • the foreground area correction unit 156 stores information (corrected foreground image area information) indicating the corrected foreground image area (t0) in the foreground area information storage unit 1585 of the buffer unit 158.
  • the foreground area correction unit 156 outputs information (corrected foreground image area information) indicating the corrected foreground image area (t0) to the mask information generation unit 157.
  • information corrected foreground image area information
  • the foreground area correction unit 156 will be described in detail.
  • FIG. 12 is a schematic block diagram showing the configuration of the foreground area correction unit 156 according to the present embodiment.
  • the foreground region correction unit 156 includes a movement amount calculation unit 1561, a foreground region probability map generation unit 1562, a foreground region determination unit 1563, and a boundary region correction unit 1564.
  • a movement amount (t0, t0-k) (also referred to as a motion vector) obtained by subtracting the position in the processed image (t0-k) from the position in the processed image (t0) is calculated. That is, the movement amount (t0, t0-k) represents the movement amount that the foreground area image (t0-k) has moved from time t0-k to time t0.
  • the movement amount calculation unit 1561 calculates the movement amount (t0, t0-k) by performing template matching (also referred to as motion search) processing shown in FIG.
  • FIG. 13 is an explanatory diagram for explaining template matching according to the present embodiment.
  • the horizontal axis is time t
  • the vertical axis is the y coordinate
  • the direction perpendicular to the horizontal axis and the vertical axis is the x coordinate.
  • the image with the symbol Ik represents the processed image at time (t0-k).
  • An image region denoted by reference symbol Ok represents a foreground image region at time (t0-k) and a circumscribed rectangle surrounding the foreground image region (object).
  • the coordinates with the reference symbol Ak represent the coordinates of the starting point position of the circumscribed rectangle surrounding the foreground area indicated by the reference symbol Ok.
  • the image with the symbol Mk is an image indicated by the mask information (t0-k) in the circumscribed rectangle at time (t0-k).
  • the mask information (t0-k) is information for identifying the foreground image region (white portion in FIG. 13) and the background image region (black portion in FIG. 13) within the circumscribed rectangle.
  • the foreground image region (t0-k) indicated by t0-k) and the other region in the circumscribed rectangle are set as the background image region (t0-k).
  • a symbol Vk is a vector from the coordinate Ak to the coordinate A0. This vector represents the amount of movement (t0, t0-k) of the foreground image area (t0-k).
  • the movement amount calculation unit 1561 uses the foreground image region Ok as a template, moves the template on the processed image (t0) (may be rotated or enlarged / reduced), and has the highest similarity to the template ( Detected area). The movement amount calculation unit 1561 calculates a difference in coordinates between the detected estimated area and the foreground image area Ok as a movement amount (t0, t0-k).
  • the movement amount calculation unit 1561 calculates coordinates (x0, y0) (referred to as initial search coordinates) of the center of gravity of the ROI area indicated by the ROI information (t0).
  • the movement amount calculation unit 1561 performs a spiral search around the initial search coordinates (x0, y0) to detect the estimated region and calculates the movement amount (t0, t0-k).
  • the spiral search is a coordinate in which the range of the foreground image area (t0) is high (in this case, the search initial coordinates) so as to gradually expand the range in a spiral order as shown in FIG. It is a technique for moving and searching for an estimated area.
  • the movement amount calculation unit 1561 extracts a movement amount having a similarity higher than a predetermined value, the spiral search may end there. Thereby, the movement amount calculation unit 1561 can reduce the calculation amount.
  • the movement amount calculation unit 1561 calculates the similarity R SAD using the following formula (9) using the coordinates selected in the spiral order (referred to as selected coordinates) as the center of gravity, and determines the region having the smallest value as the estimation region.
  • M ⁇ N (W1 ⁇ L1 in the example of FIG. 2) represents the size of the template
  • (i, j) represents the coordinates of the pixels in the template
  • t0 ⁇ k) is The pixel value at the position of coordinates (i, j) is represented.
  • (dx, dy) is a value (offset value) obtained by subtracting the center of gravity of the ROI area indicated by the ROI information (t0-k) from the selected coordinates, and I (i + dx, j + dy
  • the pixel value at the coordinates (i + dx, j + dy) is represented.
  • Equation (9) indicates that the absolute value is calculated by the Manhattan distance (L 1 -distance, L 1 -norm) and the sum of i and j is taken.
  • t0) is the value of each color component in the RGB space
  • t0) is the value of each color component in the RGB space
  • t0) is expressed by the following equation (10).
  • the movement amount calculation unit 1561 outputs the calculated movement amount (t0, t0-k) to the foreground region probability map generation unit 1562.
  • w k represents a weighting factor
  • Dx k and dy k represent the x component and the y component of the movement amount (t, t ⁇ k), respectively.
  • t0-k) is “1” when the pixel at the coordinates (x, y) of the processed image (t0-k) is the foreground image area (t0-k). When the area is not the area (t0-k) (the background image area), the value is “0”.
  • the weighting factor w k may be set according to the time distance from t0, for example, as shown in Expression (12). That is, for the foreground area information at a time away from time t0, the value of the weighting factor is set small.
  • t0) for all coordinates (x, y) of the processed image (t0) is referred to as a foreground region probability map P (t0).
  • the foreground area probability map P (t0) is expressed by the following equation (13).
  • W represents the number of pixels in the horizontal direction of the processed image (t0)
  • H represents the number of pixels in the vertical direction of the processed image (t0).
  • the foreground area probability map generation unit 1562 uses the following expression (14) for the calculated foreground area probability map P (t0), and the pixel at the coordinates (x, y) of the processed image (t0) M (x, y
  • the foreground area probability map generation unit 1562 sets M (x, y
  • t0) is set to “0” (background image region).
  • t0) takes a value of 0 to 1, and is represented by the following equation (15), for example.
  • the foreground area probability map generation unit 1562 outputs the calculated foreground area information M (x, y
  • the boundary line correction unit 1564 performs a contour correction process along the contour of the foreground image region indicated by the foreground region information M (x, y
  • FIG. 15 is a flowchart showing an example of the operation of the foreground area correction process according to the present embodiment.
  • the movement amount calculation unit 1561 includes information from time t0 to time t0-K (video information (t0-k), video information (t0), foreground area information (t0-k), foreground area information ( t0), ROI information (t0-k), and ROI information (t0)) are read from the buffer unit 158.
  • the foreground area probability map generation unit 1562 reads the foreground area information (t0-k) from time t0 to time t0-K from the buffer unit 158. Thereafter, the process proceeds to step S207-02.
  • Step S207-02 The movement amount calculation unit 1561 calculates the movement amount (t0, t0-k) of the foreground image area (t0-k) based on the information read out in step S207-01. Thereafter, the process proceeds to step S207-03.
  • Step S207-03 The movement amount calculation unit 1561 determines whether or not the movement amount (t0, t0-k) from time t0-1 to time t0-K has been calculated (whether there is an unprocessed buffer). To do. If it is determined that the movement amount (t0, t0-k) from time t0-1 to time t0-K has been calculated (Yes), the process proceeds to step S207-04. On the other hand, if it is determined that there is a time t0-k for which the movement amount (t0, t0-k) is not calculated (Yes), the value of k is changed, and the process returns to step S207-02.
  • the foreground area probability map P (t0) is calculated using the foreground area information (t0-k) read out in step (b). Thereafter, the process proceeds to step S207-05.
  • Step S207-05 The foreground area probability map generation unit 1562 uses the expression (13) for the foreground area probability map P (t0) calculated in step S207-04 to calculate the foreground area information M (x, y
  • the foreground area probability map generation unit 1562 extracts the foreground image area as an area having foreground area information M (x, y
  • t0) 1. Thereafter, the process proceeds to step S208-01.
  • the boundary line correction unit 1564 performs outline correction processing along the outline of the foreground image area indicated by the foreground area information M (x, y
  • the mask information generation unit 157 generates a mask representing the corrected foreground image region (t) indicated by the information input from the foreground region correction unit 156. Note that the mask information generation unit 157 may generate a mask representing a background area that is an area other than the foreground image area. The mask information generation unit 157 outputs the generated mask information (t).
  • the buffer unit 158 After completion of the foreground area extraction process of the video at time (t0), the buffer unit 158 performs various data (video information (t), information indicating the foreground image area (t), depth at the time (t) that satisfies the following condition A: Information (t), ROI information (t), etc. are discarded, and various data (video information (t0), information indicating foreground image area (t0), depth information (t0), ROI information (t0) at time (t0) ) And the like.
  • FIG. 16 is a flowchart illustrating an example of the operation of the buffer unit 158 according to the present embodiment.
  • the buffer unit 158 searches for an empty buffer for storing various information at the time (t0). Thereafter, the process proceeds to step S158-02.
  • Step S158-02 The buffer unit 158 determines whether or not there is an empty buffer as a result of the search in step S158. If it is determined that there is an empty buffer (Yes), the process proceeds to step S158-05. On the other hand, if it is determined that there is no empty buffer (No), the process proceeds to step S158-03.
  • Step S158-03 The buffer unit 158 selects a buffer (referred to as a target buffer) in which information satisfying the condition A is stored. Thereafter, the process proceeds to step S158-04.
  • Step S158-04 The buffer unit 158 discards the various data stored in the target buffer selected in Step S158-03, thereby emptying the target buffer (clearing the storage area). Thereafter, the process proceeds to step S158-05.
  • Step S158-05 The buffer unit 158 stores various data at time (t) in the target buffer, and ends the buffer update control.
  • FIG. 17 is a flowchart illustrating an example of the operation of the object extraction unit 15 according to the present embodiment.
  • Step S201 The object extraction unit 15 reads various data (video information (t), depth information (t), ROI information (t)). Specifically, video information (t) is input to the filter unit 151a and the buffer unit 158, depth information (t) is input to the filter unit 151b, and ROI information (t) is input to the distribution model estimation unit 152 and the buffer unit 158. . Thereafter, the process proceeds to step S202.
  • Step S202 The object extraction unit 15 determines whether there is an extraction target image by determining whether the extraction target flag included in the ROI information (t) indicates presence or absence. If it is determined that there is an extraction target image (Yes), the process proceeds to step S203. On the other hand, when it determines with there being no extraction object image (No), it progresses to step S210.
  • Step S203 The filter unit 151a removes noise from the video information (t) input in step S201 and performs a smoothing process.
  • the filter unit 151b removes noise from the depth information (t) input in step S201 and performs a smoothing process. Thereafter, the process proceeds to step S204.
  • Step S204 The distribution model estimation unit 152, based on the depth information (t) smoothed in Step S203 and the ROI information (t) input in Step S201, the ROI depth distribution information (t) in the ROI. Is estimated. Thereafter, the process proceeds to step S205.
  • Step S205 The clustering unit 153 divides the processed image (t) into superpixels by performing clustering on the video information (t) smoothed in step S203.
  • the clustering unit 153 performs labeling for each super pixel to generate region information (t).
  • the process proceeds to step S206.
  • the feature amount calculation unit 154 includes the region information (t) generated in step S205, the video information (t) smoothed in step S203, the depth information (t) smoothed, and the ROI information. Based on (t), a feature value for each region (label) is calculated. Thereafter, the process proceeds to step S207.
  • Step S209 The mask information generation unit 157 generates mask information indicating a mask representing the foreground image region (t0) corrected in step S208. Thereafter, the process proceeds to step S210.
  • Step S210 The mask information generation unit 157 stores the mask information generated in step S209 in the mask information storage unit 16.
  • FIG. 18 is a schematic diagram illustrating an example of depth information (t) according to the present embodiment.
  • images D1, D2, and D3 indicate depth information (t1-2), depth information (t1-1), and depth information (t1), respectively.
  • portions having the same color indicate the same depth.
  • FIG. 18 shows that the bright (light) image portion of the color has a smaller depth (positioned forward) than the dark (dark) image portion of the color.
  • the depth information (t) in FIG. 18 is obtained by acquiring the amount of deviation of the parallax from the right-eye camera based on the left-eye camera with respect to the video shot by the stereo camera.
  • the left part of the image surrounded by the chain line (the part attached with the symbols U1 to U3) has a difference in parallax because the video seen from the left eye camera and the video seen from the right eye camera are different. This is an indefinite area where the amount cannot be determined.
  • FIG. 19 is a schematic diagram illustrating an example of the foreground area probability map P (t0) according to the present embodiment.
  • images P1, P2, and P3 respectively show a foreground area probability map P (t1-2), a foreground area probability map P (t1-1), and a foreground area probability map P (t1).
  • This foreground area probability map P (t) is calculated based on the depth information (t) in FIG.
  • FIG. 20 is an explanatory diagram of an example of the foreground image area (t) according to the present embodiment.
  • images M1a, M2a, and M3a are respectively a foreground image area (t1-2), a foreground image area (t1-1), and a foreground image area (after correction by the foreground area correction unit 156 according to the present embodiment).
  • Images M1b, M2b, and M3b are a foreground image area (t1-2), a foreground image area (t1-1), and a foreground image area (t1), respectively, according to the prior art.
  • the shape of the foreground image area is smoothed (stabilized), and compared with the images M1b, M2b, and M3b, the foreground image area has a missing portion and an erroneous extraction portion. Occurrence has been reduced. Thereby, in this embodiment, even when the images M1a, M2a, and M3a are reproduced along the time, it is possible to suppress the occurrence of flicker and flicker due to the discontinuity of the extracted shape.
  • the foreground area correction unit 156 converts the foreground image area (t0) indicated by the foreground area information (t0) at time t0 to the foreground area information (t0-k) at time t0-k. And video information (t0-k).
  • the video processing apparatus 1 can suppress the occurrence of flicker and flicker due to the discontinuity of the extraction shape, and can reliably extract the image of the object.
  • the movement amount calculation unit 1561 performs the video information (t0) and the ROI information (t0) at the time t0, and the video information (t0-k) and the foreground area information (at the time t0-k).
  • the foreground region probability map generation unit 1562 that calculates the amount of movement of the foreground image region (t0-k) from time t0-k to time t0 based on the t0-k) and ROI information Based on the movement amount calculated by the unit 1561 and the foreground image area (t0-k), a foreground area probability map P (t0) in which each coordinate in the video at time t0 is the foreground image area is calculated.
  • the boundary region correction unit 1564 extracts the foreground region information (t0) at time t0 based on the foreground region probability map P (t0) calculated by the foreground image probability map generation unit 1562, and the extracted foreground region information (t0) The foreground image area (t0) shown is corrected. Thereby, the video processing apparatus 1 can suppress the occurrence of flicker and flicker due to the discontinuity of the extraction shape, and can reliably extract the image of the object.
  • the filter unit 151a uses the bilateral filter of Expression (1), so that each image of the video information (t) is converted into a skeleton component in which an edge component is held, noise, and a pattern. Can be separated into texture components.
  • the filter unit 151b removes the noise of the depth information (t) and smoothes it.
  • the estimation accuracy of the mixed model of the depth distribution model in the distribution model estimation unit 152 can be improved.
  • the foreground area extraction unit 105 can accurately determine the threshold value relating to the depth used for the integration process of the basic foreground area and each area in the ROI.
  • the clustering unit 153 performs clustering on the skeleton component image after the edge holding smoothing filter.
  • the video processing apparatus 1 can obtain a superpixel group that is stable (robust) against noise and texture. Note that the super pixel represents a meaningful area having a certain large area.
  • the video processing apparatus 1 can obtain the depth distribution model of the foreground region with higher accuracy by obtaining the depth distribution model in the ROI using the ROI information by the mixed model.
  • the foreground area extraction unit 155 can accurately determine the threshold value relating to the depth used for the integration process of the basic foreground area and each area in the ROI.
  • the foreground region correction unit 156 corrects a missing portion of the extraction target image region at time (t0) or an erroneous extraction portion of the extraction target image region at time (t0) to extract the extracted image shape in the time direction ( Flickering and flickering due to discontinuity of (contour) can be suppressed.
  • the video processing apparatus 1 performs smoothing processing by removing noise in the video information (t). Thereby, in the video processing apparatus 1, it can suppress that the super pixel group with a small area
  • the movement amount calculation unit 1561 uses a motion search such as a spiral search, so that the amount of calculation required to obtain the movement amount (t0, t0-k) can be reduced. Further, the movement amount calculation unit 1561 may use only the white portion (foreground region) on the mask Mk when calculating the similarity (see FIG. 13). Thereby, in the video processing apparatus 1, it is possible to prevent a search error of the movement amount and to omit unnecessary calculation, compared to a case where template matching is performed including the background region.
  • the present invention is not limited to this, and other selection tools may be used.
  • a selection tool shown in FIGS. 21 and 22 may be used.
  • FIG. 21 is a schematic diagram illustrating another example of the user-specified ROI information (ts) detection process according to the present embodiment.
  • FIG. 21 is a diagram when an ellipse (circle) selection tool is used, and shows that the user has surrounded the target image to be extracted with the ellipse selection tool.
  • the position information of the frame reference circumscribed circle of the object O; the circumscribed circle includes an ellipse
  • the user-specified ROI information (ts) is recorded as data in Table 2 below, for example.
  • user-specified ROI information includes time ts (or frame number), presence / absence flag (extraction target flag) indicating whether or not an extraction target image exists in the circumscribed circle, and the center of the circumscribed circle
  • the position (x0, y0) (the coordinates of the point P2 in FIG. 21), the minor axis direction of the circumscribed circle (the direction of the vector labeled with the symbol D21 in FIG. 21), and the short side length (the length represented by the symbol W2 in FIG. ),
  • the major axis direction of the circumscribed circle (the direction of the vector labeled with D22 in FIG. 21) and the length of the long side (the length represented by the symbol L2 in FIG. 21).
  • FIG. 22 is a schematic diagram showing another example of detection processing of user-specified ROI information (ts) according to the present embodiment.
  • FIG. 22 is a diagram when a freehand selection tool is used, and represents that the user has surrounded the target image to be extracted.
  • the position information of the frame (the circumscribed shape of the object O) denoted by reference numeral r3 is user-specified ROI information (ts).
  • the user-specified ROI information (ts) is recorded, for example, as data in Table 3 below.
  • the user-specified ROI information (ts) includes time ts (or may be a frame number), a presence / absence flag (extraction target flag) indicating whether or not an extraction target image exists in the circumscribed shape, and the starting point of the circumscribed shape Position (x0, y0) (the position of the point at which the user started to input the circumscribed shape) (the coordinates of the point P3 in FIG. 22), on the edge of the circumscribed shape clockwise from the starting point position (or may be counterclockwise) It is represented by a chain code that represents a point.
  • the chain code is a numerical value of a position of a point B adjacent to a certain point A, and a numerical value of a position of a point C (a point other than the point A) adjacent to the adjacent point B. , And so on, and a line is represented by the combination of those numerical values.
  • the ROI information (t) obtained from the ROI acquisition unit 13 is used for each processing image unit, and the shape of the ROI surrounding the image area to be extracted is superimposed on the processing image (t). Then, the user may be notified that the image area to be extracted is selected. Further, the frame number and time information of the currently displayed processed image (t) may be presented to the user.
  • the foreground image area (t0-k) at the previous time t0-k is used for the foreground image area (t0) at the time t0 with the foreground area correction unit 156 has been described.
  • the invention is not limited to this.
  • only the foreground image region (t0 + k) at time t0 + k after a certain time t0 may be used, or only the foreground image region (t0 ⁇ k) at time t0 ⁇ k before and after a certain time t0 may be used. .
  • k 1 may be sufficient.
  • the depth information (t) may not be one piece of information for one pixel of the video information (t), but one piece of information for a plurality of adjacent pixels. Also good. That is, the resolution represented by the depth information (t) may be different from the resolution of the video information (t). Further, the depth information (t) is calculated by, for example, stereo matching in which a subject is imaged by a plurality of adjacent imaging devices, and a displacement such as the position of the subject is detected from a plurality of captured video information to calculate a depth. Information. However, the depth information (t) is not limited to information calculated by a passive stereo method such as stereo matching, but an active three-dimensional measuring instrument (range finder) using light such as TOF (Time-Of-Flight) method. It may be the information acquired by.
  • the video display unit 14 is a touch panel type display
  • the present invention is not limited to this, and other input means may be used, and the video processing apparatus 1 may display video.
  • an input unit for example, a pointing device such as a mouse
  • a pointing device such as a mouse
  • the ROI acquisition unit 13 may extract the ROI information (t) using, for example, any of the following methods (1) to (5).
  • ROI acquisition unit 13 extracts the ROI information (t) using the feature points.
  • the present invention is not limited to this, for example, the distribution of the color information of the user-specified area
  • ROI information (t) may be extracted by a particle filter or Mean-shift.
  • the ROI acquisition unit 13 may extract the ROI information (t) using a known motion search.
  • the filter units 151a and 151b use bilateral filters.
  • the filter units 151a and 151b may be other filters, for example, TV (Total Variation) filters. , K-nearest neighbor averaging filter, median filter, low pass filter may be used only for the flat part with small edge strength.
  • the filter parts 151a and 151b may perform an edge smoothing filter recursively.
  • the video processing device 1 may perform the edge holding smoothing filter processing on the video information (t) and the depth information (t) before inputting them to the object extraction unit 15.
  • the distribution model estimation unit 152 may set the number of classes Kc used for the mixed model to a predetermined value, or may determine a value as in the following example.
  • the distribution model estimation unit 152 sets a predetermined class number Kc ′ as the class number Kc, and performs clustering by the K-means method.
  • the distribution model estimation unit 152 performs a process of merging the class Ci and the class Cj into a new class Ck ′. Do.
  • the distribution model estimation unit 152 determines the number of classes Kc ( ⁇ Kc ′) by repeating this process until the number of classes converges to a constant value.
  • the method used by the distribution model estimation unit 152 to estimate the depth distribution model is not limited to a parametric estimation method such as a mixed model, and may be a non-parametric estimation method such as a Mean-shift method.
  • Clustering in the image space is a method for performing region division based on the similarity between pixels or pixel groups (regions) constituting the region in the original image space without mapping to the feature amount space. is there.
  • the clustering unit 153 may perform clustering in the image space using the following method.
  • (A) Pixel Combining Method For example, the clustering unit 153 represents the connection relationship between pixels as a weighted undirected graph, and performs region integration based on the strength of the edge representing the connection relationship so that the vertices form a global minimum tree.
  • Region growth method also referred to as region growing method
  • C Region division integration method (also referred to as Split & Merge method)
  • D A method combining any one of (a), (b), and (c) Note that the clustering unit 153 performs labeling after the clustering processing in the image space, and region information (label information) indicating the labeling result ( t).
  • the movement amount calculation unit 1561 calculates the similarity R SAD (Equation (9)) (SAD (Sum of Absolute Difference)), and determines the region having the smallest value as the estimation region.
  • R SAD Equivalent Binary Difference
  • SAD Sud of Absolute Difference
  • the movement amount calculation unit 1561 calculates the absolute value of the difference between the corresponding pixel values between the images using the Euclidean distance (L 2 ⁇ distance, L 2 ⁇ norm), and the sum R SDD (the following equation (16)) The region having the smallest value of is determined as the estimated region.
  • Expression (16) indicates that the absolute value is calculated by the Euclidean distance (L 2 -distance, L 2 -norm), and the sum of i and j is taken.
  • NCC Normalized Cross-Correlation
  • the movement amount calculation unit 1561 determines an area where the RNCC value of the following equation (17) is closest to 1 as an estimated area.
  • CCC Cross-Correlation Coefficient
  • the movement amount calculation unit 1561 determines an area where the value of R CCC in the following equation (18) is closest to 1 as an estimated area.
  • the eye bar (“-” (bar) on “I” (eye)) and tea bar (“-” (bar) on “T” (tea)) in formula (18) are respectively This represents an average vector of pixel values in the indicated area.
  • the operation amount of the movement amount calculation unit 1561 increases in the order of the equations (9), (16), (17), and (18).
  • the movement amount calculation unit 1561 may calculate the movement amount using a hierarchical search method (also referred to as multi-resolution method or coarse-to-fine search method) instead of the spiral search. Good.
  • the clustering unit 153 may perform clustering on the images in the ROI based on the ROI information (t) obtained from the ROI acquisition unit 13. Thereby, the clustering unit 153 can reduce the arithmetic operation. Further, the clustering unit 153 may perform clustering on an area wider than the target image area based on the ROI information. As a result, the clustering unit 153 can improve the accuracy of clustering as compared to the case where clustering is performed on images in the ROI.
  • the region integration unit 1534 determines whether or not a part of the region exceeds the ROI boundary with respect to the region having the center of gravity in the ROI when determining region integration. You may determine using a rectangle. Thereby, the video processing apparatus 1 can reduce erroneous extraction of the background area as the foreground area.
  • the area integration unit 1534 may determine the area integration using the adjacent relationship between the areas instead of using the feature amount of the basic foreground area. For example, the region integration unit 1534 may determine the region integration with the region adjacent to the region that has already been determined to be the foreground region, using the feature amount of the region that has already been determined to be the foreground region. As a result, the video processing apparatus 1 can extract the foreground region with higher accuracy.
  • a part of the video processing apparatus 1 in the above-described embodiment may be realized by a computer.
  • the program for realizing the control function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by a computer system and executed.
  • the “computer system” is a computer system built in the video processing apparatus 1 and includes an OS and hardware such as peripheral devices.
  • the “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM or a CD-ROM, and a hard disk incorporated in a computer system.
  • the “computer-readable recording medium” is a medium that dynamically holds a program for a short time, such as a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line,
  • a volatile memory inside a computer system serving as a server or a client may be included and a program that holds a program for a certain period of time.
  • the program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.
  • a part or all of the video processing device 1 in the above-described embodiment may be realized as an integrated circuit such as an LSI (Large Scale Integration).
  • Each functional block of the video processing apparatus 1 may be individually made into a processor, or a part or all of them may be integrated into a processor. Further, the method of circuit integration is not limited to LSI, and may be realized by a dedicated circuit or a general-purpose processor. Further, in the case where an integrated circuit technology that replaces LSI appears due to progress in semiconductor technology, an integrated circuit based on the technology may be used.
  • the present invention is suitable for use in a video processing apparatus.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Studio Devices (AREA)

Abstract

L'invention porte sur une unité de correction de région de premier plan qui est un dispositif de traitement vidéo qui extrait des informations de région de premier plan représentant une image d'un premier plan à partir d'informations vidéo représentant une vidéo, l'image de premier plan représentée par les informations de région de premier plan étant corrigées durant un premier temps à l'aide des informations de région de premier plan et des informations vidéo durant un second temps.
PCT/JP2011/073639 2010-10-14 2011-10-14 Dispositif de traitement vidéo, procédé de traitement vidéo et programme de traitement vidéo Ceased WO2012050185A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2010-231928 2010-10-14
JP2010231928A JP5036084B2 (ja) 2010-10-14 2010-10-14 映像処理装置、映像処理方法、及びプログラム

Publications (1)

Publication Number Publication Date
WO2012050185A1 true WO2012050185A1 (fr) 2012-04-19

Family

ID=45938406

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2011/073639 Ceased WO2012050185A1 (fr) 2010-10-14 2011-10-14 Dispositif de traitement vidéo, procédé de traitement vidéo et programme de traitement vidéo

Country Status (2)

Country Link
JP (1) JP5036084B2 (fr)
WO (1) WO2012050185A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709328A (zh) * 2020-05-29 2020-09-25 北京百度网讯科技有限公司 车辆跟踪方法、装置及电子设备
US20200402256A1 (en) * 2017-12-28 2020-12-24 Sony Corporation Control device and control method, program, and mobile object
US11087169B2 (en) * 2018-01-12 2021-08-10 Canon Kabushiki Kaisha Image processing apparatus that identifies object and method therefor

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2013248207A1 (en) * 2012-11-15 2014-05-29 Thomson Licensing Method for superpixel life cycle management
KR101896301B1 (ko) * 2013-01-03 2018-09-07 삼성전자주식회사 깊이 영상 처리 장치 및 방법
JP6174894B2 (ja) * 2013-04-17 2017-08-02 キヤノン株式会社 画像処理装置および画像処理方法
JP6341650B2 (ja) * 2013-11-20 2018-06-13 キヤノン株式会社 画像処理装置、画像処理方法及びプログラム
JP6445775B2 (ja) * 2014-04-01 2018-12-26 キヤノン株式会社 画像処理装置、画像処理方法
JP6546385B2 (ja) * 2014-10-02 2019-07-17 キヤノン株式会社 画像処理装置及びその制御方法、プログラム
JP6403207B2 (ja) * 2015-01-28 2018-10-10 Kddi株式会社 情報端末装置
JP6655513B2 (ja) * 2016-09-21 2020-02-26 株式会社日立製作所 姿勢推定システム、姿勢推定装置、及び距離画像カメラ
JP2020167441A (ja) * 2019-03-28 2020-10-08 ソニーセミコンダクタソリューションズ株式会社 固体撮像装置、及び電子機器
DE112021001882T5 (de) * 2020-03-26 2023-01-12 Sony Semiconductor Solutions Corporation Informationsverarbeitungseinrichtung, informationsverarbeitungsverfahren und programm
JP7467773B2 (ja) * 2021-05-24 2024-04-15 京セラ株式会社 教師データ生成装置、教師データ生成方法、及び画像処理装置
WO2023047643A1 (fr) * 2021-09-21 2023-03-30 ソニーグループ株式会社 Appareil de traitement d'informations, procédé de traitement d'image et programme
JP2023086370A (ja) * 2021-12-10 2023-06-22 オムロン株式会社 オブジェクト検出装置、オブジェクト検出方法、およびオブジェクト検出プログラム

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11112871A (ja) * 1997-09-30 1999-04-23 Sony Corp 画像抜き出し装置および画像抜き出し方法、画像符号化装置および画像符号化方法、画像復号装置および画像復号方法、画像記録装置および画像記録方法、画像再生装置および画像再生方法、並びに記録媒体
JP2001076161A (ja) * 1999-09-02 2001-03-23 Canon Inc 画像処理方法及び装置並びに記憶媒体
JP2008523454A (ja) * 2004-12-15 2008-07-03 ミツビシ・エレクトリック・リサーチ・ラボラトリーズ・インコーポレイテッド 背景領域および前景領域をモデリングする方法
JP2009526495A (ja) * 2006-02-07 2009-07-16 クゥアルコム・インコーポレイテッド モード間の関心領域画像オブジェクト区分

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002024834A (ja) * 2000-07-11 2002-01-25 Canon Inc 画像処理装置及びその方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11112871A (ja) * 1997-09-30 1999-04-23 Sony Corp 画像抜き出し装置および画像抜き出し方法、画像符号化装置および画像符号化方法、画像復号装置および画像復号方法、画像記録装置および画像記録方法、画像再生装置および画像再生方法、並びに記録媒体
JP2001076161A (ja) * 1999-09-02 2001-03-23 Canon Inc 画像処理方法及び装置並びに記憶媒体
JP2008523454A (ja) * 2004-12-15 2008-07-03 ミツビシ・エレクトリック・リサーチ・ラボラトリーズ・インコーポレイテッド 背景領域および前景領域をモデリングする方法
JP2009526495A (ja) * 2006-02-07 2009-07-16 クゥアルコム・インコーポレイテッド モード間の関心領域画像オブジェクト区分

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200402256A1 (en) * 2017-12-28 2020-12-24 Sony Corporation Control device and control method, program, and mobile object
US11822341B2 (en) * 2017-12-28 2023-11-21 Sony Corporation Control device, control method, and mobile object to estimate the mobile object's self-position
US11087169B2 (en) * 2018-01-12 2021-08-10 Canon Kabushiki Kaisha Image processing apparatus that identifies object and method therefor
CN111709328A (zh) * 2020-05-29 2020-09-25 北京百度网讯科技有限公司 车辆跟踪方法、装置及电子设备
CN111709328B (zh) * 2020-05-29 2023-08-04 北京百度网讯科技有限公司 车辆跟踪方法、装置及电子设备

Also Published As

Publication number Publication date
JP2012085233A (ja) 2012-04-26
JP5036084B2 (ja) 2012-09-26

Similar Documents

Publication Publication Date Title
JP5036084B2 (ja) 映像処理装置、映像処理方法、及びプログラム
EP3104332B1 (fr) Manipulation d'images numériques
KR102121707B1 (ko) 오브젝트 디지타이제이션 기법
CN105069808B (zh) 基于图像分割的视频图像深度估计方法
Crabb et al. Real-time foreground segmentation via range and color imaging
KR101670282B1 (ko) 전경-배경 제약 조건 전파를 기초로 하는 비디오 매팅
Gonçalves et al. HAIRIS: A method for automatic image registration through histogram-based image segmentation
US9542735B2 (en) Method and device to compose an image by eliminating one or more moving objects
CN105989604A (zh) 一种基于kinect的目标物体三维彩色点云生成方法
CN102034247B (zh) 一种基于背景建模对双目视觉图像的运动捕捉方法
US10249029B2 (en) Reconstruction of missing regions of images
Wang et al. Simultaneous matting and compositing
CN106296732B (zh) 一种复杂背景下的运动目标追踪方法
CN114913463B (zh) 一种图像识别方法、装置、电子设备及存储介质
CN105809673A (zh) 基于surf算法和合并最大相似区域的视频前景分割方法
JP6272071B2 (ja) 画像処理装置、画像処理方法及びプログラム
AU2014277855A1 (en) Method, system and apparatus for processing an image
JP2011517226A (ja) デジタルピクチャにおいて対象の鮮明度を高めるシステム及び方法
CN114600160A (zh) 生成三维(3d)模型的方法
Recky et al. Façade segmentation in a multi-view scenario
WO2022056875A1 (fr) Procédé et appareil de segmentation d'image de plaque signalétique et support de stockage lisible par ordinateur
Engels et al. Automatic occlusion removal from façades for 3D urban reconstruction
Finger et al. Video Matting from Depth Maps
JP2013120504A (ja) オブジェクト抽出装置、オブジェクト抽出方法、及びプログラム
Xiang et al. A modified joint trilateral filter based depth map refinement method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11832615

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11832615

Country of ref document: EP

Kind code of ref document: A1