CN1466737A

CN1466737A - Image conversion and coding technology

Info

Publication number: CN1466737A
Application number: CNA018162142A
Authority: CN
Inventors: P・V・哈曼; P·V·哈曼; 福克斯; S·R·福克斯; 多雷; M·R·多雷; 福来克; J·C·福来克
Original assignee: DYNAMIC DIGITAL RESEARCH Pty Ltd
Current assignee: DYNAMIC DIGITAL RESEARCH Pty Ltd
Priority date: 2000-08-09
Filing date: 2001-08-09
Publication date: 2004-01-07

Abstract

A method of creating a depth map comprising the steps of: assigning a depth to at least one pixel or portion of an image; determining the relative position and image properties of the at least one pixel or portion of the image; using the depth, image properties, and respective relative positions to determine An algorithm to determine depth characteristics as a function of relative position and image characteristics; using said algorithm to calculate depth characteristics for each pixel or portion of the image, wherein the depth characteristics form a depth map of the image. In the second stage of processing, the depth map forms key frames, which are used to generate a depth map of non-key frames using relative positions, image characteristics, and distances to key frames.

Description

Image conversion and coding technology

发明领域field of invention

本发明是一种从一个或多个2D图像获得深度图的改进技术。The present invention is an improved technique for obtaining a depth map from one or more 2D images.

发明背景Background of the invention

一定数量的图像处理任务要求图像内目标的深度是已知的。这种任务包括对电影和视频序列应用特殊效果以及2D图像到立体3D的转换。决定目标的深度被称为创建深度图的过程。在一深度图中，每个目标都被着色为灰色阴影，由此用阴影表示目标到固定点的深度。通常，距离远的目标将着色为暗色阴影，而邻近的的目标将着色得较亮。还未采用创建深度图的标准转换，而且可使用倒着色或使用不同的颜色来表示不同的深度。为了便于在本文中作解释，距离远的目标所着的颜色比距离近的目标要深，而且着色通常是灰色标度。A certain number of image processing tasks require that the depth of objects within an image be known. Such tasks include the application of special effects to films and video sequences and the conversion of 2D images to stereoscopic 3D. Determining the depth of a target is known as the process of creating a depth map. In a depth map, each object is colored in shades of gray, whereby the depth of the object from a fixed point is indicated by the shade. Typically, distant objects will be shaded darker, while nearby objects will be shaded lighter. The standard transformations for creating depth maps have not been adopted, and inverted coloring can be used or different colors can be used to represent different depths. For ease of explanation in this article, distant targets are colored darker than closer targets, and the coloring is usually a gray scale.

传统上一直采用手工的方法从现有的2D图像创建深度图。可以理解的是，图像对于计算机来说只是一系列像素，尽管操作人员能够分辨目标及其相关的深度。Traditionally, manual methods have been used to create depth maps from existing 2D images. Understandably, an image is just a series of pixels to a computer, although the operator is able to discern the target and its associated depth.

深度图的创建涉及一种系统和分配给目标的深度，其中通过该系统可手工勾画出每个要转换的目标图像。可以理解该过程是缓慢、耗时和昂贵的。勾画的步骤通常结合鼠标使用软件程序来完成。可用于执行该任务的软件程序的例子是Adobe“After Effects”。使用After Effects的操作人员通常在需要分配深度的每个目标的轮廓周围勾画，并随后用决定距观察者的深度或距离的灰色阴影来对目标物填充或“着色”。该过程将随后对图像中的每个目标进行重复。此外，在包含数个图像的场合，比如电影中，也必须对电影的每个图像或帧执行这些步骤。The creation of a depth map involves a system by which each target image to be converted is manually delineated and the depths assigned to the targets. This process is understandably slow, time consuming and expensive. The delineation step is usually accomplished using a software program in conjunction with a mouse. An example of a software program that can be used to perform this task is Adobe "After Effects". Operators using After Effects typically sketch around the outline of each target that needs to be assigned depth, and then fill or "color" the target with a shade of gray that determines the depth or distance from the viewer. This process will then be repeated for each object in the image. Furthermore, where several images are involved, such as in a movie, these steps must also be performed for each image or frame of the movie.

在传统的系统中，图像的轮廓通常被描述成曲线(比如贝塞尔曲线)的形式。对这种曲线的使用使操作者能够改变轮廓的形状，从而使该目标的轮廓能够与该目标准确对准。In traditional systems, the outline of an image is usually described in the form of a curve (such as a Bezier curve). The use of such a curve enables the operator to change the shape of the profile so that the profile of the target can be accurately aligned with the target.

如果一系列图像需要绘制——比如——电影或视频的深度，则需要连续地对每个帧重复该过程。If a sequence of images needs to be drawn for - say - the depth of a film or video, the process needs to be repeated for each frame in succession.

一个目标的大小、位置和/或深度可能会连续变化。在该情况下，操作者需要手工地跟踪每个帧中的目标并通过修正曲线来处理每个帧，并根据需要通过改变灰色阴影来更新目标深度。可以理解，这是一个缓慢、乏味耗时和昂贵的过程。A target may continuously vary in size, position and/or depth. In this case, the operator needs to manually track the target in each frame and process each frame by correcting the curve and updating the target depth by changing shades of gray as needed. Understandably, this is a slow, tedious, time-consuming and expensive process.

过去已作出尝试以改进该过程。现有技术描述了旨在当目标一帧帧移动时自动跟踪目标轮廓的技术。这种技术的一个例子是Active Contours的应用(参考文献：Active Contours-Andrew Blake和Michael Isard-ISBN 3-540-76217-5)。这种方法的主要局限性是需要对使用该技术的软件提供被跟踪目标的预期运动。当预期运动是未知且复杂变形的，或者需要对具有不同运动特性的多个目标进行同时跟踪的时候，这便受到了极大的限制。Attempts have been made in the past to improve this process. The prior art describes techniques aimed at automatically tracking the contours of objects as they move frame by frame. An example of this technique is the application of Active Contours (References: Active Contours - Andrew Blake and Michael Isard - ISBN 3-540-76217-5). The main limitation of this approach is the need to provide the software using the technique with the expected motion of the tracked target. This is severely limited when the expected motion is unknown and complexly deformed, or when multiple targets with different motion characteristics need to be tracked simultaneously.

也已使用了以点为基础的跟踪方法来确定轮廓线的运动。这些在诸如Commotion和After Effects之类的编辑环境中是普遍的。但是，应用它们的局限性是非常大的，因为通常不可能确定一个其运动反映整个目标运动的合适的跟踪点。当目标进行简单的平移时，点跟踪有时是可接受的，但它无法处理形状变形、遮蔽或多种其它常见的问题。Point-based tracking methods have also been used to determine the motion of contour lines. These are prevalent in editing environments such as Commotion and After Effects. However, their application is very limited, since it is usually impossible to determine a suitable tracking point whose motion reflects the motion of the whole object. Point tracking is sometimes acceptable when the object is doing simple translations, but it cannot handle shape deformations, occlusions, or a variety of other common problems.

一家以色列公司，AutoMedia，已生产了一种称为AutoMasker的软件产品。它使操作者能够勾画出目标的轮廓并一帧帧地跟踪它。该产品取决于度目标颜色的跟踪，因此当相近的颜色交叉在一起的时候就无效了。当比如，一目标趋近观察者或在屏幕上向前运动时，该产品在跟踪整个连续帧的大小发生变化的目标的过程中也会有问题。An Israeli company, AutoMedia, has produced a software product called AutoMasker. It enables the operator to outline a target and track it frame by frame. This product depends on tracking of the target color, so it will not work when similar colors cross together. The product may also have problems tracking objects that vary in size across successive frames when, for example, an object approaches the viewer or moves forward on the screen.

这些方法中无一能令人满意地分配或跟踪深度图，因此对深度图的创建仍然使用手工系统。None of these methods satisfactorily allocate or track depth maps, so manual systems are still used for depth map creation.

在现有技术中还描述了其它的技术，这些技术取决于对原来用于记录2D序列的摄影机的运动重建。这些技术的局限性是需要原来图像序列中的摄影机运动以及在每个帧中的可用作跟踪点的明确定义的特征。There are also other techniques described in the prior art which rely on motion reconstruction of the camera originally used to record the 2D sequence. The limitations of these techniques are the need for camera motion in the original image sequence and well-defined features in each frame that can be used as tracking points.

发明目的purpose of invention

当今，操作者必须手工地创建每个图像帧的深度图以获得可接受的结果。本发明的目的是减少需要手工创建深度的帧的数量，从而减少操作者创建深度图所需要的时间。Today, operators must manually create depth maps for each image frame to obtain acceptable results. It is an object of the present invention to reduce the number of frames for which depth needs to be manually created, thereby reducing the time required for the operator to create the depth map.

仍然有需要手工创建深度图的一组帧。本发明的另一个目的是协助对这些帧进行深度图创建的手工过程。There is still a set of frames where the depth map needs to be manually created. Another object of the invention is to assist the manual process of depth map creation for these frames.

发明内容Contents of the invention

考虑了以上的目的，本发明提供了一种创建深度图的方法，它包括以下步骤：Considering the above purpose, the present invention provides a method for creating a depth map, which includes the following steps:

对图像的至少一个像素或部分分配一深度；assigning a depth to at least one pixel or portion of the image;

确定所述图像的至少每个像素或部分的相对位置和图像特性；determining the relative position and image characteristics of at least each pixel or portion of the image;

利用所述的深度、图像特性和各个相对位置来确定第一算法的配置，从而确定作为相对位置和图像特性的函数的深度特性；using said depth, image properties and respective relative positions to determine the configuration of the first algorithm to determine depth properties as a function of relative position and image properties;

利用所述的第一算法来计算所述图像的每个像素或部分的深度特性；using said first algorithm to calculate a depth characteristic for each pixel or portion of said image;

其中，所述的深度特性形成了所述图像的深度图。Wherein, the depth characteristics form a depth map of the image.

本发明的另一方面提供了一种创建深度图的方法，它包括如下步骤：Another aspect of the present invention provides a method for creating a depth map, which includes the following steps:

向一图像的至少一个像素或一部分分配深度；assigning depth to at least one pixel or a portion of an image;

确定所述图像的至少每个所述像素或部分的x，y坐标和图像特性；determining x,y coordinates and image properties for at least each of said pixels or portions of said image;

利用所述深度、图像特性和各个x，y坐标来确定第一算法，从而确定作为x，y坐标和图像特性的函数的深度特性；using said depth, image properties and respective x, y coordinates to determine a first algorithm to determine depth properties as a function of x, y coordinates and image properties;

使用所述的第一算法来计算所述图像的每一像素或部分的深度特性；using said first algorithm to calculate a depth characteristic for each pixel or portion of said image;

本发明的另一个方面提供了一种为图像序列创建一系列深度图的方法，它包括以下步骤：Another aspect of the invention provides a method of creating a series of depth maps for an image sequence, comprising the steps of:

接收所述图像序列的至少一帧的深度图；receiving a depth map of at least one frame of the sequence of images;

使用所述的深度图来确定一算法的配置，从而确定作为相对位置和图像特性的函数的深度特性；using said depth map to determine the configuration of an algorithm to determine depth characteristics as a function of relative position and image characteristics;

使用所述算法来创建所述图像序列中每一帧的深度图。The algorithm is used to create a depth map for each frame in the sequence of images.

本发明的还有一个方面提供了一种创建图像序列一系列深度图的方法，它包括以下步骤：Still another aspect of the present invention provides a method for creating a series of depth maps of an image sequence, which includes the following steps:

从所述图像序列中选出至少一个关键帧；selecting at least one key frame from the sequence of images;

对于每一个至少一关键帧，为每个帧的至少一个像素或一部分分配一深度；For each of the at least one keyframe, assigning a depth to at least one pixel or a portion of each frame;

确定每个所述帧至少一像素或一部分的相对位置(比如，x，y坐标)和图像特性；determining the relative position (e.g., x, y coordinates) and image characteristics of at least one pixel or portion of each of said frames;

使用每个所述至少一帧的深度、图像特性和相对位置来确定每个所述至少一帧的算法的配置，从而确定作为相对位置和深度特性的函数的深度特性；determining a configuration of an algorithm for each of said at least one frame using the depth, image characteristics and relative position of each said at least one frame to determine a depth characteristic as a function of the relative position and depth characteristic;

使用所述算法的每个配置来计算每个所述至少一帧的每个像素或部分的深度特性；computing a depth characteristic for each pixel or portion of each of said at least one frame using each configuration of said algorithm;

其中，所述的深度特性形成了每个所述至少一帧的深度图，Wherein, the depth characteristics form a depth map of each of the at least one frame,

使用每幅深度图来确定第二算法的第二配置，从而确定每幅帧作为相对位置和Use each depth map to determine the second configuration of the second algorithm, thereby determining each frame as a relative position and

使用每幅深度图来确定第二算法的第二配置，从而确定每幅帧作为相对位置和图像特性的函数的深度特性；using each depth map to determine a second configuration of the second algorithm to determine depth characteristics for each frame as a function of relative position and image characteristics;

使用所述第二算法来创建所述图像序列每帧的各个深度图。Individual depth maps for each frame of the sequence of images are created using the second algorithm.

可以理解的是，涉及算法的系统实际上可创建数个不同的函数，从而创建作为相对位置和图像特性结果的深度图。在较佳的系统中，相对位置将是x，y坐标的测量值。It will be appreciated that a system involving algorithms may actually create several different functions to create a depth map as a result of relative position and image properties. In a preferred system, the relative position will be a measure of x,y coordinates.

实现本发明的系统可选择预定，序列中哪些帧被当作关键帧，比如每一个第五帧。该算法也可以理想地认为输入到算法以进一步提炼处理过程的时间。A system embodying the present invention may optionally predetermine which frames in the sequence are considered key frames, such as every fifth frame. The algorithm can also ideally consider the time input to the algorithm to further refine the processing.

发明简述Brief description of the invention

本发明旨在改进产生相关2D图像的深度图的处理。该较佳实施例包括产生关键帧深度图和产生剩余图的两个阶段。The present invention aims to improve the process of generating a depth map of a correlated 2D image. The preferred embodiment includes two stages of generating keyframe depth maps and generating residual maps.

第一个阶段从用户处得到了小量的数据。该数据表示景物的基本结构。2D图像和其相关的数据产生了一种算法，该算法能够获得在由用户分配给不同图像像素的深度z、其x和y位置以及图像特性之间的关系。该图像特性包括每个像素的RGB值，当不仅限于此。总的来说，该算法解答了用户已经在帧中为每个像素定义的等式z＝f(x，y，R，G，B)。The first stage gets a small amount of data from the user. This data represents the basic structure of the scene. The 2D image and its associated data yield an algorithm capable of obtaining the relationship between the depth z assigned by the user to different image pixels, their x and y positions, and image properties. The image properties include, but are not limited to, the RGB values of each pixel. In general, the algorithm solves the equation z=f(x, y, R, G, B) that the user has defined for each pixel in the frame.

该算法随后将该获得的关系施加于图像中其它的像素，以产生深度图。如果需要，用户可提炼其数据以改进深度图的准确性。应该指出的是，初始的深度数据不一定要由用户规定——它可以由某些其它的处理来决定，这些处理包括从运动算法使用自动结构或从立体图像得到深度预测，但不仅限于此。The algorithm then applies this obtained relationship to other pixels in the image to produce a depth map. Users can refine their data to improve the accuracy of the depth map if desired. It should be noted that the initial depth data does not have to be specified by the user - it could be determined by some other processing including, but not limited to, depth predictions from motion algorithms using automatic structures or from stereo images.

第二阶段要求在所选的关键帧处设置2D图像及相关的深度图。在这些关键帧处的深度图可如先前由申请人公开般地产生，或者使用深度捕捉技术自动产生，这些技术包括激光测距仪，即LIDAR(光线方向和范围)仪器和景深技术，但不仅限于此。The second stage requires setting a 2D image and associated depth map at selected keyframes. Depth maps at these keyframes can be generated as previously disclosed by applicant, or automatically using depth capture techniques including laser rangefinders, i.e. LIDAR (Light Direction and Range) instruments and depth of field technology, but not only limited to this.

由每个关键帧的2D图像及相关的深度图可得到一种算法，该算法能够获得在分配给其它帧中每个像素的深度z、其x和y位置以及图像特性之间的关系。该图像特性包括每个像素的RGB值，当不仅限于此。总的来说，该算法解答了关键帧中每个像素的等式z＝f(x，y，R，G，B)。From each keyframe's 2D image and the associated depth map an algorithm is able to obtain the relationship between the depth z assigned to each pixel in other frames, its x and y position and image properties. The image properties include, but are not limited to, the RGB values of each pixel. In general, the algorithm solves the equation z=f(x,y,R,G,B) for each pixel in the keyframe.

该算法随后以邻近关键帧之间的每个连续帧呈现，而且每个像素都使用该算法计算z的值。The algorithm is then presented with each successive frame between adjacent keyframes, and the algorithm is used to calculate the value of z for each pixel.

附图说明Description of drawings

图1示出了阶段1训练过程的一个实施例。Figure 1 shows an embodiment of the Phase 1 training process.

图2示出了阶段1转换过程的一个实施例。Figure 2 illustrates one embodiment of a Phase 1 conversion process.

图3示出了阶段2训练过程的一个实施例。Figure 3 illustrates one embodiment of a Phase 2 training process.

图4示出了阶段2转换过程的一个实施例。Figure 4 illustrates one embodiment of a Phase 2 conversion process.

图5描绘了学习过程可怎样分割特征空间。Figure 5 depicts how the learning process can partition the feature space.

图6示出了阶段2的一个深度图产生的可替换过程。FIG. 6 shows an alternative process for stage 2 generation of a depth map.

图7示出了在阶段2确定各个像素深度的可替换方法。Figure 7 shows an alternative method of determining individual pixel depths at stage 2.

图8示出了搜索候选训练样本的过程。Fig. 8 shows the process of searching for candidate training samples.

图9示出了从数个候选训练样本计算深度的过程。Figure 9 shows the process of computing depth from several candidate training samples.

对发明的详细描述Detailed description of the invention

本发明提供了从一个或多个2D图像得到深度图的一种改进技术。本发明较佳地包括两个阶段，每个阶段理论上都包括一个自动学习过程。The present invention provides an improved technique for deriving a depth map from one or more 2D images. The invention preferably comprises two stages, each theoretically comprising an automatic learning process.

阶段1stage 1

第一阶段在单个的图像上操作。该图像呈现给用户，用户使用简单的图形界面确定图像中不同区域的近似深度。该图形界面可提供有助于用户为像素分配深度的工具，该工具包括笔和漆刷工具、区域填充工具和根据像素颜色分配深度的工具，但不仅限于此。这个过程的结果是为图像中的像素子集确定深度。The first stage operates on a single image. This image is presented to the user, who uses a simple graphical interface to determine the approximate depth of different areas in the image. The graphical interface provides tools to help the user assign depth to pixels, including but not limited to pen and paint brush tools, area fill tools, and tools to assign depth based on pixel color. The result of this process is to determine depth for a subset of pixels in the image.

在图1中示出了这个过程，其中2D图像1可呈现给用户。用户可随后为图像2中不同的像素分配深度。在图1的例子中，标上″X″的像素是还未由用户规定深度的像素。系统随后使2D图像1与由用户提供的深度数据2互相关联，并使用训练算法3来协助映射函数4的创建，该算法能够解答图像中每个像素深度的函数。This process is illustrated in Figure 1, where a 2D image 1 can be presented to the user. The user can then assign depths to different pixels in image 2. In the example of FIG. 1, pixels marked with an "X" are pixels whose depth has not been specified by the user. The system then correlates the 2D image 1 with depth data 2 provided by the user and uses a training algorithm 3 to assist in the creation of a mapping function 4 that is able to solve for each pixel in the image as a function of depth.

由用户提供的信息决定了与学习过程一起使用的训练数据，该将在以后描述的学习过程使深度与所述单个图像中的每个像素相关。该过程可以是互动的，因为用户可只确定一些区域的近似深度。根据所述区域学习过程的结果，用户可为学习过程完成得较差的区域提供进一步的深度预测。该用户和学习过程之间的互动可重复数次。实际上，用户可在该阶段指导该学习过程。应该指出的是，原始的深度并不一定要由用户规定——它可如上所述地由某些其它的过程来决定。The information provided by the user determines the training data used together with the learning process which will be described later to relate depth to each pixel in the single image. The process can be interactive in that the user can only determine the approximate depth of some areas. Based on the results of the region learning process, the user can provide further depth predictions for regions where the learning process is poorly done. This interaction between the user and the learning process can be repeated several times. In fact, the user can guide the learning process at this stage. It should be noted that the original depth does not have to be specified by the user - it could be determined by some other process as described above.

创建映射函数Create mapping function

一旦为该系统提供图像和一些像素深度，则该系统随后用所确定的深度分析该像素，从而创建了映射函数。该映射函数可以是一过程或一将任何来自图像的一个像素或一组像素的测量值作为输入并将该像素或该组像素的深度值作为输出提供的函数。Once the system is provided with an image and some pixel depth, the system then analyzes the pixel with the determined depth, thereby creating a mapping function. The mapping function may be a procedure or a function that takes as input any measurement of a pixel or group of pixels from an image and provides as output a depth value for the pixel or group of pixels.

各个像素测量值可包括红、绿和蓝色的值，或诸如亮度、色度、对比度之类的其它测量值以及诸如固定在图像中的水平和垂直之类的空间测量值。另外，该映射函数可操作诸如较大像素组之类的较高水平的图像特征，或在像素组上、或边缘、角落等位置的诸如平均值和变化之类的测量值(即，特征检测器的响应)。较大像素组可表示比如图像中的段，该段是形成同性区域的连接的像素组。Individual pixel measurements may include red, green, and blue values, or other measurements such as luminance, chroma, contrast, and spatial measurements such as horizontal and vertical fixed in the image. Additionally, the mapping function can operate on higher level image features such as larger groups of pixels, or measurements such as mean and variation across groups of pixels, or at edges, corners, etc. (i.e., feature detection device response). A larger group of pixels may represent, for example, a segment in an image, which is a connected group of pixels forming a homogeneous region.

只是为了描述的目的，一个像素可以x，y，R，G，B，z的形式表示，其中，x和y表示作为像素x和y坐标的相对位置，R，G，B表示该像素的红色、绿色和蓝色的值，而z则表示该像素的深度。z的值只有在用户已规定一值的位置得到确定。Just for descriptive purposes, a pixel can be represented in the form of x, y, R, G, B, z, where x and y represent the relative position as the pixel's x and y coordinates, and R, G, B represent the red color of the pixel , green and blue values, and z represents the depth of the pixel. The value of z is determined only where the user has specified a value.

通过捕捉由用户规定的像素的图像数据和深度数据之间的关系来得到该映射函数。该映射函数可采用任何普通处理单元的形式，其中接收并处理输入数据，并提供输出。较佳地，该处理单元受学习过程的作用，其中，它的性质通过检查用户数据及相关的图像数据来决定。The mapping function is obtained by capturing the relationship between image data and depth data of pixels specified by the user. The mapping function can take the form of any ordinary processing unit that receives and processes input data and provides output. Preferably, the processing unit is subjected to a learning process, wherein its properties are determined by examining user data and associated image data.

那些在人工智能或机器学习领域中工作的人将理解，输入数据和所希望输出之间该关系的学习过程，并且该过程可以采用许多形式。要指出的是，这些人一般不会在立体系统或2D图像到3D的转换的领域中工作。在机器学习中，这样的映射函数是公知的，而且包括神经网络、决策树、决策图形、模型树和最近邻分类器，但不仅限于此。学习算法的较佳实施例是寻找设计一种映射函数，该映射函数将一些映射测量的误差减到最小并令人满意地推广到原始数据组外的值。Those working in the field of artificial intelligence or machine learning will understand that the learning process of this relationship between input data and desired output can take many forms. It is to be noted that these people generally do not work in the field of stereoscopic systems or the conversion of 2D images to 3D. In machine learning, such mapping functions are well known and include, but are not limited to, neural networks, decision trees, decision graphs, model trees, and nearest neighbor classifiers. A preferred embodiment of the learning algorithm seeks to design a mapping function that minimizes the error of some mapping measurement and generalizes satisfactorily to values outside the original data set.

该学习算法既可尝试决定2D图像信息和整个图像上总体深度之间的关系，又可以决定其与局部较小空间区域上深度的关系。The learning algorithm attempts to determine both the relationship between 2D image information and overall depth over the entire image, and its relationship to depth over local smaller spatial regions.

可随后应用这种关系以完成整个序列的深度图。This relationship can then be applied to complete a depth map of the entire sequence.

在图2中示出了这种关系，其中从2D图像1输入数据到所创建的映射函数4，从而创建2D图像1的深度图5。This relationship is shown in FIG. 2 , where data is input from a 2D image 1 to the created mapping function 4 , thereby creating a depth map 5 of the 2D image 1 .

成功的学习算法的例子是学习神经网络的后向传播算法、学习决策树的C4.5算法、局部加权线性回归法以及学习棒束型分类器的K-平均值算法。Examples of successful learning algorithms are the backpropagation algorithm for learning neural networks, the C4.5 algorithm for learning decision trees, locally weighted linear regression, and the K-means algorithm for learning stick-beam classifiers.

只为了描述的目的，可考虑该学习算法来计算2D图像序列帧中每个像素的以下关系For descriptive purposes only, the learning algorithm can be considered to compute the following relationship for each pixel in a frame of a 2D image sequence

z_n＝k_a.x_n+k_b.y_n+k_c.R_n+k_d.G_n+k_e.B_n其中z _n =k _a .x _n +k _b .y _n +k _c .R _n +k _d .G _n +k _e .B _n where

n是关键帧图像中的第n个像素n is the nth pixel in the keyframe image

z_n是分配给x_n，y_n处像素的深度值z _n is the depth value assigned to the pixel at x _n , y _n

k_a到k_e是常数且由算法决定k _a to k _e are constants and determined by the algorithm

R_n是在x_n，y_n处像素的红色分量值R _n is the red component value of the pixel at x _n , y _n

G_n是在x_n，y_n处像素的绿色分量值G _n is the green component value of the pixel at x _n , y _n

B_n是在x_n，y_n处像素的蓝色分量值B _n is the blue component value of the pixel at x _n , y _n

在图1中示出了该过程。This process is shown in FIG. 1 .

技术熟练人士将理解，以上的等式只是为了描述方便而作了简化，在实际中是不能理想工作的。在使用比如神经网络和给定在图像中的大数量像素的实际应用中，该网络将学习含许多k值、乘法和加法的一个大的等式。此外，该K值在图像中不同的x，y位置上会发生变化，以适合局部的图像特征。Those skilled in the art will appreciate that the above equations are simplified for the convenience of description, and may not work ideally in practice. In a practical application using eg a neural network and given a large number of pixels in an image, the network will learn a large equation with many values of k, multiplications and additions. In addition, the K value will change at different x, y positions in the image to fit local image features.

对2D图像应用该映射函数Apply the mapping function to the 2D image

接着，本发明将采用该映射函数并将其应用于2D图像序列的整个帧上。对于给定的像素，映射函数的输入以与在学习过程中得到映射函数相似的方式确定。例如，如果该映射函数通过将单个像素的测量值作为输入来获得，则映射函数将需要这些作为输入的相同测量值。通过这些输入，该映射函数完成了它所学习的任务并输出深度测量值。同样，在该例中，对于单个像素，这个深度测量值可以是简单的深度值。在这个例子中，在整个图像上应用该映射函数，从而完成图像的全部深度数据组。另外，如果使用较大的像素组来训练映射函数，则需要为图像产生这种较大像素组。用与在学习过程中相同的方式完成对这些像素组较高水平的测量，诸如平均值和方差。利用这些现在建立的输入，映射函数生成了对于该像素组的所需深度测量。Next, the present invention will take this mapping function and apply it over the entire frame of the 2D image sequence. For a given pixel, the input to the mapping function is determined in a similar manner as the mapping function is obtained during learning. For example, if the mapping function is obtained by taking measurements of individual pixels as input, the mapping function will require these same measurements as input. With these inputs, the mapping function does what it has learned and outputs a depth measurement. Again, in this example, for a single pixel, this depth measurement could be a simple depth value. In this example, the mapping function is applied over the entire image, thus completing the entire set of depth data for the image. Additionally, if a larger set of pixels is used to train the mapping function, such a larger set of pixels needs to be generated for the image. Higher level measures of these pixel groups, such as mean and variance, are done in the same way as in the learning process. With these now established inputs, the mapping function generates the desired depth measurement for this set of pixels.

在图2中示出了这个过程，并得到了2D图像的完整深度图。如果所得到的深度图包含有误差的区域，则将对用户数据进行修改并重复该过程以修正这些区域。该映射函数也可用于其它的帧以产生深度图。This process is shown in Figure 2 and results in a full depth map of the 2D image. If the resulting depth map contains areas of error, the user data will be modified and the process repeated to correct these areas. This mapping function can also be used on other frames to generate depth maps.

对机器学习领域熟悉的人将会理解，该训练阶段可隐含于算法的一般配置。该方法被称为基于学习的实例，并且包括诸如局部加权线性回归之类的技术，但不仅限于此。在可替换的实施例中，用户可定义一组目标并对这些目标分配像素。在该实施例中，将图像其它像素的用户数据进行归纳的过程可将整个图像分割成由用户初始定义的目标组。映射函数定义目标或目标本身可以是该实施例所需要的输出。另外，可对目标应用函数以规定这些目标的深度，从而构成图像的深度图。这些函数可采用深度斜坡形式和其它诸如在先申请PCT/AU00/00700中定义般定义目标的深度的其它方法。Those familiar with the field of machine learning will understand that this training phase may be implicit in the general configuration of the algorithm. The method is called instance-based learning and includes techniques such as locally weighted linear regression, but is not limited to them. In an alternative embodiment, a user may define a set of targets and assign pixels to those targets. In this embodiment, the process of summarizing user data for other pixels of the image may segment the entire image into target groups initially defined by the user. The mapping function definition target or the target itself may be the desired output of this embodiment. In addition, functions can be applied to objects to specify the depths of these objects, thereby constituting the depth map of the image. These functions may take the form of depth ramps and other methods of defining the depth of the target such as defined in prior application PCT/AU00/00700.

在另一可替换实施例中，该训练算法可尝试向用户信息引入一随机分量。利用任何学习算法都可帮助克服训练过度的问题。训练过度指的是学习算法仅仅记住训练信息的情况。这就好比是一个孩子只学会写乘法表而对乘法本身无任何理解一样。该问题在机器学习的领域中是公知的，并且缓解该问题的一个方法就是向训练数据引入随机噪声。将施加一良好的学习算法以在训练数据的噪声和质量信息之间进行辨别。在进行该过程当中，鼓励学习数据的性质而不是仅仅记住它。该方法的一个实施例指的是先前的例子，其中该训练算法学习以下的函数：In another alternative embodiment, the training algorithm may attempt to introduce a random component to user information. Utilizing any learning algorithm can help overcome the problem of overtraining. Overtraining refers to the situation where the learning algorithm only memorizes the training information. This is like a child who only learns to write the multiplication table without any understanding of multiplication itself. This problem is well known in the field of machine learning, and one way to alleviate it is to introduce random noise to the training data. A good learning algorithm will be applied to discriminate between noise and quality information of the training data. In doing so, learning the nature of the data rather than just memorizing it is encouraged. An embodiment of the method refers to the previous example, where the training algorithm learns the following function:

z_n＝k_a.x_n+k_b.y_n+k_c.R_n+k_d.G_n+k_e.B_n z _n =k _a .x _n +k _b .y _n +k _c .R _n +k _d .G _n +k _e .B _n

当向训练算法输入z、x、y、R、G和B的时候，可向这些值添加小噪声分量。该噪声分量可以是一小的正随机数或负随机数。在较佳实施例中，未向z分量添加任何噪声。When z, x, y, R, G, and B are input to the training algorithm, a small noise component may be added to these values. This noise component can be a small positive or negative random number. In a preferred embodiment, no noise is added to the z component.

学习过程learning process

在较佳实施例中，学习过程的输入是：In a preferred embodiment, the inputs to the learning process are:

1.数个训练样本，它们具有某些包括深度在内的特性。1. A number of training samples that have certain properties including depth.

2.数个“分类”样本，它们具有匹配训练样本的特性，而且其深度由学习过程决定。2. A number of "classification" samples that have properties that match the training samples and whose depth is determined by the learning process.

该训练样本所包括的各个像素，其特性包括该像素的位置(x，y)、颜色(R，G，B)和深度(z)。该学习过程目的就是计算其特性包括位置(x，y)和颜色(R，G，B)的每个分类像素的深度(z)。The characteristics of each pixel included in the training sample include the pixel's position (x, y), color (R, G, B) and depth (z). The purpose of this learning process is to compute the depth (z) of each classified pixel whose properties include position (x, y) and color (R, G, B).

对于每个分类样本，学习算法的第一阶段包含识别训练样本的分组，该分组训练样本共享与在讨论中的分类像素“类似”的图像特性。For each classified sample, the first stage of the learning algorithm consists of identifying a group of training samples that share image properties that are "similar" to the classified pixel in question.

搜索候选训练样本Search for candidate training samples

为了识别与当前分类样本特性相似的训练样本，我们设想一发生样本的n维特征空间。在较佳实施例中，这是一个5维空间，每一维都表示一个图像特性：x、y、R、G和B。将该空间的轴进行归一化，以考虑每一维范围中的差异。我们可以因此用相对百分比来指出样本之间的差异。例如，给定样本的R分量可相对于第二样本有10％(相对于R分量的绝对范围)的不同。In order to identify training samples with characteristics similar to those of the current classifier, we imagine an n-dimensional feature space of generated samples. In the preferred embodiment, this is a 5-dimensional space, with each dimension representing an image property: x, y, R, G, and B. The axes of this space are normalized to account for differences in the extent of each dimension. We can therefore use relative percentages to indicate differences between samples. For example, the R component of a given sample may differ by 10% (relative to the absolute range of the R component) relative to a second sample.

该空间中两个样本之间的距离是它们的相似性的量度。为了检测与当前分类样本相似的训练样本，定义了一个搜索半径。任何到分类样本的距离小于搜索半径的训练样本都被认为与分类样本相似并被用于深度的计算。使用简单的欧几里得尺度来测量n维搜索空间中的距离。在不占用n维特征空间较大部分的数据中，使用马哈拉诺比斯距离尺度以提供较好的结果。诸如直方图均衡化或RGB、YUV或HSV分量的原理分析之类的延伸数据范围的可替换方法提供了类似的优点。The distance between two samples in this space is a measure of their similarity. In order to detect training samples similar to the current classification sample, a search radius is defined. Any training sample whose distance to the classified sample is smaller than the search radius is considered similar to the classified sample and used for depth calculation. Use a simple Euclidean scale to measure distances in an n-dimensional search space. In data that does not occupy a large portion of the n-dimensional feature space, the Mahalanobis distance scale is used to provide better results. Alternative methods of extending the data range such as histogram equalization or principle analysis of RGB, YUV or HSV components offer similar advantages.

在对深度的准确估算中，搜索半径是关键的参数，并且是相对于数据特性进行设置的。在呈现高空间或高时间自相关的数据中，该半径的值被设置为小于具有低空间或低时间自相关的图像的值。The search radius is a critical parameter in accurate estimation of depth and is set relative to data characteristics. In data exhibiting high spatial or high temporal autocorrelation, the value of this radius is set to be smaller than that of images with low spatial or low temporal autocorrelation.

搜索半径对于特征空间的每一维可以是不同的。例如，x轴中的搜索半径可以不同于表示红色强度的轴当中的搜索半径。此外，学习过程可将这些参数应用于某个用户定义的界限中的数据。例如，如果在5％的空间半径和10％的彩色半径之中未识别到任何适合的训练样本，则该空间半径将被增加到10％。The search radius can be different for each dimension of the feature space. For example, the search radius in the x-axis may be different from the search radius in the axis representing red intensity. Additionally, the learning process can apply these parameters to data within some user-defined bounds. For example, if no suitable training samples are identified among a 5% spatial radius and a 10% colored radius, the spatial radius will be increased to 10%.

图8示出了候选搜索处理的典型例子。该图描绘了一2维搜索空间，以描述所图绘样本的空间x坐标中的变化对红色强度中的变化。在该空间中是数个训练样本20。在目标像素11的第一半径21的距离中无训练样本。因此，学习过程将其搜索扩展到目标像素11的第二搜索半径22并识别到3个候选训练样本。Fig. 8 shows a typical example of candidate search processing. The figure depicts a 2-dimensional search space to describe the change in the spatial x-coordinate of the plotted sample versus the change in red intensity. In this space are several training samples 20 . There are no training samples within the distance of the first radius 21 of the target pixels 11 . Therefore, the learning process extends its search to the second search radius 22 of the target pixel 11 and identifies 3 candidate training samples.

可使用可替换的搜索策略以识别合适的训练候选样本。用这种策略，可将训练数据以诸如散列树、k-d树或n维沃龙诺依图之类的结构进行存储。虽然该策略提高了识别候选训练样本所用的速度，但它们不会影响本发明的性质。Alternative search strategies can be used to identify suitable training candidates. With this strategy, training data can be stored in structures such as hash trees, k-d trees, or n-dimensional Voronoi graphs. Although such strategies increase the speed at which candidate training samples are identified, they do not affect the properties of the invention.

类似地，通过高速缓存训练样本开发特征空间中其后分类样本接近度的搜索策略，可提高识别候选训练样本的速度，但不会被大量地添加到本发明中。Similarly, a search strategy that exploits the proximity of later classified samples in feature space by caching training samples can improve the speed of identifying candidate training samples, but does not add significantly to the present invention.

距离加权学习distance weighted learning

为了计算任何给定分类样本的深度，我们需要一个或多个训练样本，该训练样本被认为与如上所述的分类样本相似。我们把这些训练样本称为“候选”训练样本。In order to compute the depth for any given classified sample, we need one or more training samples that are considered to be similar to the classified samples as described above. We refer to these training samples as "candidate" training samples.

我们将分类样本的深度计算为候选训练样本深度的加权平均值。任何候选训练样本的加权值都相对于在n维空间中距分类样本的距离。如上所述，该距离被归一化并且可使用马哈拉诺比斯尺度或主要分量类型分析进行数据偏置。We compute the depth of the classified samples as a weighted average of the depths of the candidate training samples. The weight of any candidate training sample is relative to the distance from the classified sample in n-dimensional space. As described above, this distance is normalized and the data can be biased using the Mahalanobis scale or principal component type analysis.

图9示出了深度计算过程被简化后的例子。如图8所示，图9描绘了一2维搜索空间，以描绘被图绘样本的空间x坐标上的变化对红色强度上的变化。这些示出的候选训练样本19与目标像素11相隔不同的距离(标上w1、w2和w3)。该深度可作为候选训练样本的加权平均值进行计算，用以下公式：

Fig. 9 shows a simplified example of the depth calculation process. As shown in FIG. 8, FIG. 9 depicts a 2-dimensional search space for plotting a change in the spatial x-coordinate of the mapped sample versus a change in red intensity. These candidate training samples 19 are shown at different distances from the target pixel 11 (labeled w1, w2 and w3). This depth can be calculated as a weighted average of candidate training samples using the following formula:

其中，D1是与目标像素11相隔距离w1的训练样本的深度，D2是与目标像素相隔距离w2的训练样本的深度，而D3是与目标像素11相隔距离w3的训练样本的深度。Among them, D1 is the depth of the training sample at a distance w1 from the target pixel 11, D2 is the depth of the training sample at a distance w2 from the target pixel, and D3 is the depth of the training sample at a distance w3 from the target pixel 11.

在较佳实施例中，加权值与n维空间中距离的平方成反比。In a preferred embodiment, the weights are inversely proportional to the square of the distance in n-dimensional space.

可替换实施例Alternative embodiment

在可替换实施例中，学习过程分析了整个可供应训练数据组并推论了决定图像性质与样本深度关系的规则。In an alternative embodiment, the learning process analyzes the entire set of available training data and deduces rules that determine the relationship of image properties to sample depth.

在该过程中，n维特征空间被分割或划分为一组区域。图5示出了该原理的简化表示。在该实例中，n维空间被决策界限23分割成数个矩形区域。根据所占用的区域将深度值分配给目标像素11。In this process, the n-dimensional feature space is segmented or partitioned into a set of regions. Figure 5 shows a simplified representation of this principle. In this example, the n-dimensional space is divided into several rectangular areas by the decision boundary 23 . A depth value is assigned to the target pixel 11 according to the occupied area.

在操作中，使用M5模型树算法以完成特征空间的划分。M5算法以两种方法提高了根据以上所描述的基本例子。决策界限不必垂直于特征空间轴，而且深度可以在各个区域中作为图像特性的线性函数变化。In operation, the M5 model tree algorithm is used to complete the division of the feature space. The M5 algorithm improves upon the basic example described above in two ways. The decision bounds do not have to be perpendicular to the feature space axis, and the depth can vary across regions as a linear function of image properties.

机器学习技术熟练的人将理解，可使用许多学习方案来替代M5模型树算法，包括神经网络、决策树、决策图形和最邻近分类器。学习算法的实际特性不会影响本发明的新颖性。Those skilled in machine learning will understand that many learning schemes can be used in place of the M5 model tree algorithm, including neural networks, decision trees, decision graphs, and nearest neighbor classifiers. The actual nature of the learning algorithm does not affect the novelty of the invention.

在较佳实施例中，在图像性质x、y、R、G和B中操作学习过程。可替换的实施例可以在诸如较大像素组之类的较高水平的图像特性上操作，也可以在像素组上操作诸如平均值和方差或界限值、角落值等之类的测量(即，响应特征检测器)。较大像素组可比如表示图像中的段，它们是形成同性质区域的连接像素组。In a preferred embodiment, the learning process operates on image properties x, y, R, G and B. Alternative embodiments may operate on higher level image properties such as larger groups of pixels, and may also operate on measurements such as mean and variance or bounds, corner values, etc. (i.e., response feature detector). Larger groups of pixels may eg represent segments in an image, which are connected groups of pixels forming homogeneous regions.

第二阶段second stage

第二阶段在图像序列上进行操作，其中至少将一个帧确定为关键帧。它对于每个关键帧接收3D立体数据，通常以深度图的形式。该深度图可归因子任何处理，诸如人工规定、上述第一阶段的输出、从立体图像确定的深度或使用范围寻找系统直接获得深度，但不仅限于此。另外，3D立体信息可采用除深度图之外的其它形式，比如从包含立体象对的关键帧获得的差别信息。The second stage operates on a sequence of images, where at least one frame is identified as a keyframe. It receives 3D volumetric data, usually in the form of a depth map, for each keyframe. This depth map can be attributed to any process such as, but not limited to, manual specification, output of the first stage described above, depth determined from stereoscopic images, or depth obtained directly using a range finding system. Additionally, 3D stereo information may take other forms than depth maps, such as disparity information obtained from keyframes containing stereo pairs.

对于2D图像序列中的其它所有帧来说，本发明根据初始获得的关键帧信息提供了对深度图的规定。希望，关键帧的数量只是总数量帧中的一小部分。因此，本发明提供了一种方法，它可大大减少一开始就需要生成的深度图的数量。For all other frames in the 2D image sequence, the present invention provides the specification of the depth map according to the initially obtained key frame information. Hopefully, the number of keyframes is only a small fraction of the total number of frames. Thus, the present invention provides a method that greatly reduces the number of depth maps that need to be generated in the first place.

创建映射函数Create mapping function

一旦对该系统提供关键帧以及它们对应的深度图后，该系统就分析关键帧和对应的最初得到的深度图，从而创建了映射函数。该映射函数可以是将任何2D图像的给定测量值作为输入的过程或函数，并且将该图像的深度图作为输出提供。该映射通过捕捉关键帧图像数据和这些图像的深度图数据之间的关系得以学习。Once the system is provided with keyframes and their corresponding depth maps, the system analyzes the keyframes and corresponding originally derived depth maps to create a mapping function. The mapping function may be a procedure or function that takes as input a given measurement of any 2D image and provides as output a depth map of that image. The map is learned by capturing the relationship between keyframe image data and the depth map data for those images.

该映射函数可采用任何普通处理单元的形式，其中接收并处理输入数据，并给出输出。较佳地，该处理单元应服从学习过程，其中该性质通过检查关键帧数据和其对应的深度图来确定。在机器学习的领域中，这种映射函数是公知的，而且包括神经网络、决策树、决策图形、模型树和最接近分类器，但不仅限于此。The mapping function can take the form of any ordinary processing unit that receives and processes input data and gives an output. Preferably, the processing unit should be subject to a learning process, wherein the property is determined by examining keyframe data and their corresponding depth maps. Such mapping functions are well known in the field of machine learning and include, but are not limited to, neural networks, decision trees, decision graphs, model trees, and closest classifiers.

该系统试图学习输入数据和所需输出数据之间的关系。在一学习过程中，为训练算法提供来自2D关键帧图像的信息。该信息可以以像素为单位呈现在像素上，其中提供了诸如红、绿和蓝色值之类的像素测量，或诸如亮度、色度、对比度之类的其它测量以及诸如图像中水平和垂直位置之类的空间测量。另外，可以较高水平图像特征的形式来呈现信息，诸如较大像素组，并在像素组上进行诸如平均值和方差或界限值、角落值等之类的测量(即，特征检测器的响应)。较大像素组可比如表示图像中的段，它们是形成同性质区域的连接像素组。The system attempts to learn the relationship between input data and desired output data. During a learning process, the training algorithm is provided with information from 2D keyframe images. This information can be presented on a pixel-by-pixel basis, providing pixel measurements such as red, green, and blue values, or other measurements such as luminance, chromaticity, contrast, and other measurements such as horizontal and vertical position in the image such as spatial measurements. Additionally, information can be presented in the form of higher level image features, such as larger groups of pixels, and measurements such as mean and variance or boundary values, corner values, etc. are made over groups of pixels (i.e., the response of a feature detector ). Larger groups of pixels may eg represent segments in an image, which are connected groups of pixels forming homogeneous regions.

只是为了描述，2D图像可以x，y，R，G，B的形式来表示，其中，x和y表示每个像素的x和y坐标，而R，G，B则表示该像素的红、绿和蓝色值。Just for description, 2D images can be represented in the form of x, y, R, G, B, where x and y represent the x and y coordinates of each pixel, and R, G, B represent the red, green and the blue value.

接着，向训练算法呈现对应的深度图，从而可获得其所需的映射。通常向训练算法呈现各个像素。但是，如果正在使用较高水平的图像特性，诸如较大像素组或段，则该深度图可以是该像素组的深度测量值，诸如平均值和方差。Next, the corresponding depth map is presented to the training algorithm so that the mapping it needs can be obtained. Individual pixels are typically presented to the training algorithm. However, if higher level image characteristics are being used, such as larger pixel groups or segments, then the depth map may be depth measurements for that pixel group, such as mean and variance.

只为了描绘的目的，该深度图可以z，x，y的形式来表示，其中x和y表示每个像素的x和y坐标，而z则表示分配给该对应像素的深度值。For illustration purposes only, the depth map may be represented in the form z, x, y, where x and y represent the x and y coordinates of each pixel, and z represents the depth value assigned to the corresponding pixel.

这些在人工智能领域工作的人士将理解这个学习在输入数据和所需输出之间关系的过程，并且可采用许多形式。学习算法的较佳实施例设计分配可将某些测量的映射误差最小化的映射函数。Those working in artificial intelligence will understand that this process of learning the relationship between input data and desired output can take many forms. A preferred embodiment of the learning algorithm is designed to assign a mapping function that minimizes some measured mapping error.

该学习算法尝试归纳2D图像信息和出现在关键帧实例中的深度图之间的关系。随后将应用该归纳的形式来完成整个序列的深度图。本领域所公知的成功的学习算法的例子是用于学习神经网络的反向传播、用于学习决策树的C4.5算法以及用于学习棒束型分类器的K-平均值算法。This learning algorithm attempts to generalize the relationship between 2D image information and the depth map that occurs in keyframe instances. This inductive form is then applied to complete the depth map for the entire sequence. Examples of successful learning algorithms known in the art are backpropagation for learning neural networks, the C4.5 algorithm for learning decision trees, and the K-means algorithm for learning beam-type classifiers.

只是为了描述的目的，可假设该学习算法以计算2D图像中每个像素的以下关系Just for descriptive purposes, the learning algorithm can be assumed to compute the following relationship for each pixel in a 2D image

z_n＝k_a.x_n+k_b.y_n+k_c.R_n+k_d.G_n+k_e.B_n其中，z _n =k _a .x _n +k _b .y _n +k _c .R _n +k _d .G _n +k _e .B _n where,

n是关键帧图像中第n个像素n is the nth pixel in the keyframe image

技术熟练人士将理解，以上的等式只是为了便于解释而作的简化，它在实际中是无法工作的。在实际应用中，使用比如神经网络并给定图像中大数目的像素，该网络将学习一个含k值、乘法和加法的大等式。Those skilled in the art will appreciate that the above equation is a simplification for the sake of explanation and it will not work in practice. In practice, using e.g. a neural network and given a large number of pixels in an image, the network will learn a large equation involving k values, multiplication and addition.

在图3中示出了该过程，该过程显示了一个可使用不同数目关键帧的类似过程。This process is illustrated in Figure 3, which shows a similar process that can use a different number of keyframes.

应用映射函数apply mapping function

本发明接着将采用该映射函数并在还未得到深度图的2D图像组上应用它。对于该组中给定的2D图像，用在学习过程中提供给映射函数的相似方式来确定映射函数的输入。例如，如果通过将单个像素的测量作为输入来学习映射函数，则映射函数对于新图像中的像素来说都将需要这些相同的测量。利用这些输入，该映射函数完成了其所学习的任务并输出一深度测量值。此外，在单个像素的例子中，该深度测量可以是一简单的深度值。在该例子中，在整个图像序列中应用映射函数，以完成图像序列的整个深度数据组。另外，如果使用较大的像素组来训练映射函数，则需要为新的图像参数产生这样较大的像素组。以与学习过程中相同的方式完成这些像素组上较高水平的测量，诸如平均值和方差。利用这些已建立的输入，该映射函数为该像素组产生了所需的深度测量值。The invention will then take this mapping function and apply it on the set of 2D images for which no depth map is available. For a given 2D image in the set, the input to the mapping function is determined in a similar manner to that provided to the mapping function during learning. For example, if a mapping function is learned by taking measurements of individual pixels as input, the mapping function will require these same measurements for every pixel in the new image. Using these inputs, the mapping function does what it has learned and outputs a depth measurement. Also, in the single pixel example, the depth measurement may be a simple depth value. In this example, the mapping function is applied throughout the image sequence to complete the entire depth data set for the image sequence. Additionally, if larger pixel groups are used to train the mapping function, such larger pixel groups need to be generated for new image parameters. Higher level measurements on these groups of pixels, such as mean and variance, are done in the same way as in the learning process. Using these established inputs, the mapping function produces the desired depth measurement for that pixel group.

对于2D图像序列，具有深度图的关键帧可以任何任一的方式在序列上间隔开。在较佳实施例中，映射函数将被提供一组关键帧以及它们所对应的跨过具有某些共同点的2D图像组的深度图。在最简单的情况中，使用两个关键帧来训练映射函数，并且随后使用该映射函数来确定两个所述关键帧之间2D图像的深度图。然而，可用于训练映射函数的关键帧是没有数量限制的。此外，用于完成整个2D图像组的映射函数也是没有数量限制的。在较佳实施例中，两个由一个或多个插入帧分开的关键帧被定义为该处理中第二阶段的输入。该阶段的目的是向这些插入帧中的每一个分配深度图。对插入帧分配深度图的较佳顺序是在时间上最接近关键帧的处理帧为第一个。已被处理过的帧随后成为深度图下一帧的关键帧。For a 2D image sequence, the keyframes with the depth map can be spaced in any arbitrary way over the sequence. In a preferred embodiment, the mapping function will be provided with a set of keyframes and their corresponding depth maps across a set of 2D images with some common points. In the simplest case, two keyframes are used to train a mapping function, and the mapping function is then used to determine the depth map of the 2D image between the two said keyframes. However, there is no limit to the number of keyframes that can be used to train the mapping function. Furthermore, there is no limit to the number of mapping functions used to complete the entire set of 2D images. In the preferred embodiment, two keyframes separated by one or more intervening frames are defined as input to the second stage of the process. The purpose of this stage is to assign a depth map to each of these interpolated frames. The preferred order for assigning depth maps to interpolated frames is the processed frame closest in time to the keyframe first. The frame that has been processed then becomes the keyframe for the next frame of the depth map.

该时间变量的添加有助于训练函数对关键帧中所提供信息的归纳。缺少时间变量，两个关键帧中的深度信息会互相抵触。当相似颜色的像素出现在两个关键帧中的同一空间区域却属于不同目标的时候，这种情况可能会发生。例如，在第一关键帧中，在图像的中心可以观察到一辆绿色的汽车，该图像的前景具有深度特性。在下一关键帧中，汽车可能已经移动，显示在它后面的是绿色的小牧场，该牧场的深度特性规定了中间的地区域。对该训练算法提供了两个关键帧，该两个关键帧在图像的中心都具有绿色的像素，但却具有不同的深度特性。不可能解决该冲突，而且预计该映射函数在这样的区域中也不会很好地工作。通过引入一时间变量，该算法能够通过识别图像中心处的绿色像素是时间上最接近图像序列中第一关键帧的前景像素来解决该冲突。随着时间向第二关键帧推进，训练算法将更加能够识别出作为绿色牧场的中间地深度的图像中间的绿色像素。The addition of this time variable helps the training function generalize the information provided in the keyframes. Without a temporal variable, the depth information in the two keyframes would conflict with each other. This can happen when similarly colored pixels appear in the same spatial region in two keyframes but belong to different objects. For example, in the first keyframe, a green car can be observed in the center of the image whose foreground has depth features. In the next keyframe, the car may have moved, showing behind it a small green pasture whose depth properties define the terrain area in between. The training algorithm is provided with two keyframes, both with a green pixel in the center of the image, but with different depth characteristics. It is not possible to resolve the conflict, and the mapping function is not expected to work well in such regions. By introducing a temporal variable, the algorithm is able to resolve this conflict by identifying that the green pixel at the center of the image is the temporally closest foreground pixel to the first keyframe in the image sequence. As time progresses toward the second keyframe, the training algorithm will become more able to identify the green pixel in the middle of the image as the middle ground depth of the green pasture.

在图6的例子中示出了该过程。盒子表示图像序列的各个帧。上面一行表示源帧，根据它们在图像序列中的相对位置对它们进行编号。底下一行表示由该阶段产生的深度图。编号表示产生深度图的等级。虽然应该理解，深度帧1和2可以反向的顺序进行处理，但类似地可以使深度帧3和4反向等。关键帧7被设置为如上所述处理的输入。要产生的第一深度图如图所示与源帧1相关。使用产生的前两幅深度图来产生任何后续的深度图。This process is shown in the example of FIG. 6 . Boxes represent individual frames of an image sequence. The upper row represents the source frames, numbering them according to their relative position in the image sequence. The bottom row represents the depth map produced by this stage. The number indicates the level at which the depth map was generated. While it should be understood that depth frames 1 and 2 may be processed in reverse order, depth frames 3 and 4 may similarly be reversed, and so on. Keyframe 7 is set as input to processing as described above. The first depth map to be generated is associated with source frame 1 as shown. Use the first two depth maps generated to generate any subsequent depth maps.

较佳实施例preferred embodiment

对于要绘制深度的帧中的每个像素，使用目标像素的图像特性来确定与所述像素相关的深度。在较佳实施例中，重新得到两个深度估算值，每个关键帧得到一个。在图7中示出了该过程，其中示出了在图像序列中的当前帧之前和之后目标像素11是怎样同最接近的源关键帧6进行比较的(步骤12和13)。类似于先前所描述的，该学习过程使用搜索半径14来识别具有相似图像特性的像素，并使用与所述像素相关的深度(步骤15和16)来计算目标像素的深度(步骤17和18)。每个关键帧产生一目标像素深度的估计值，我们将其定义为D1和D2。For each pixel in the frame for which depth is to be drawn, the image properties of the target pixel are used to determine the depth associated with that pixel. In the preferred embodiment, two depth estimates are retrieved, one for each keyframe. This process is illustrated in Figure 7, which shows how the target pixel 11 is compared to the closest source keyframe 6 before and after the current frame in the image sequence (steps 12 and 13). Similar to what was previously described, the learning process uses a search radius of 14 to identify pixels with similar image properties, and uses the depth associated with said pixels (steps 15 and 16) to compute the depth of the target pixel (steps 17 and 18) . Each keyframe yields an estimate of the target pixel depth, which we denote as D1 and D2.

为了确定与目标像素相关的最终深度，D1和D2必须结合。在较佳实施例中，使用关键帧的位置作为加权参数来计算这些值的加权平均值。如果从当前帧到第一关键帧的距离是T1，从当前帧到第二关键帧的距离是T2，则目标像素的深度由下式给出： $w 1 = \frac{1}{{T 1}^{2}}$ $w 2 = \frac{1}{{T 2}^{2}}$ In order to determine the final depth relative to the target pixel, D1 and D2 must be combined. In a preferred embodiment, a weighted average of these values is calculated using the position of the keyframe as a weighting parameter. If the distance from the current frame to the first keyframe is T1 and the distance from the current frame to the second keyframe is T2, then the depth of the target pixel is given by: $w 1 = \frac{1}{{T 1}^{2}}$ $w 2 = \frac{1}{{T 2}^{2}}$

其中，D1和D2分别是从关键帧1和关键帧2计算所得的深度。where D1 and D2 are the depths calculated from keyframe 1 and keyframe 2, respectively.

在某些情况下，学习过程无法确定给定像素的深度值。如果在以上的计算过程中，两个关键帧深度估计值中的一个无法进行确定，则向要分配且未使用加权的关键帧深度估计值分配目标像素。如果两个估计值D1和D2都未被定义，则扩大搜索半径并重复该过程。In some cases, the learning process cannot determine the depth value for a given pixel. If during the above calculations one of the two keyframe depth estimates cannot be determined, assign the target pixel to the keyframe depth estimate to be assigned without weighting. If both estimates D1 and D2 are undefined, expand the search radius and repeat the process.

应该注意的是，产生任何其它帧的深度图只需要一个关键帧。但是，在目标的深度在图像序列中改变的情况中，两个或更多个如上所述加权的关键帧将提供改进的结果。It should be noted that only one keyframe is needed to generate the depth map for any other frame. However, in cases where the depth of the object changes in the sequence of images, two or more keyframes weighted as described above will provide improved results.

应该理解的是，处理帧的顺序以及将来自多个关键帧的结果进行结合的方式会在基本不影响本发明性质的前提下发生改变。It should be understood that the order in which frames are processed and the manner in which results from multiple key frames are combined can vary without substantially affecting the nature of the invention.

如在2D图像的情况中一样，可以理解，该训练阶段可隐含在基于学习的例子中，从而确定序列中图像的任何像素处的深度估计值。As in the case of 2D images, it will be appreciated that this training phase may be implicit in the learning-based example, determining a depth estimate at any pixel of an image in the sequence.

在图4中示出了该过程。This process is shown in FIG. 4 .

要理解的是，可在阶段2实现与用于阶段1的过程相似的学习过程。两个该映射函数可使用6000字节被写为一文件，则对于该代价，我们得到了20帧的深度信息。这便有效地表示了大小为每帧6000/20＝300字节的文件。在实际应用中，该有效的压缩将是重要的。It is to be understood that a learning process similar to that used for Phase 1 can be implemented in Phase 2 . Two of these mapping functions can be written as one file using 6000 bytes, then for that cost we get 20 frames of depth information. This effectively represents a file of size 6000/20 = 300 bytes per frame. In practical applications, this efficient compression will be important.

在另一应用中，以上的压缩可允许3D信息的有效传送，该3D信息嵌在2D图像源，即可与3D图像兼容的2D图像中。由于映射函数需要的文件长度通常是2D图像数据提供给3D信息的很小一部分，所以向2D图像序列添加3D图像可用非常小的系统开销来完成。In another application, the above compression may allow efficient transfer of 3D information embedded in a 2D image source, ie a 2D image compatible with the 3D image. Adding a 3D image to a 2D image sequence can be done with very little overhead since the file length required by the mapping function is usually a small fraction of the 3D information provided by the 2D image data.

在这种情况下，在观察以前或在观察时，在观察端仅通过在序列中每个2D图像被观察时在其上应用映射函数来产生3D信息。当机器学习中发现的映射函数类型在它们被训练之后对提供计算很有效的情况下，这才可能完成。通常该训练过程是缓慢且资源密集的，并且在构建3D图像内容的过程中通常是离线完成的。一旦训练好，该映射函数就可以向观察端传送，并且可用适合于2D图像向3D进行实时转换的非常高的处理量来完成。In this case, the 3D information is generated at the viewing end only by applying a mapping function on each 2D image in the sequence as it is viewed, either before viewing or at the time of viewing. This is only possible when the types of mapping functions found in machine learning are efficient for providing computation after they have been trained. Usually this training process is slow and resource intensive, and is usually done offline in the process of building 3D image content. Once trained, this mapping function can be delivered to the viewer and can be done with very high throughput suitable for real-time conversion of 2D images to 3D.

申请人自己原先的公开物涉及2D图像向立体3D图像转换的技术。所公开的转换过程包含了对与2D图像相关的深度图的归纳。在一实施例中，深度图是一帧帧地用手工创建的。在本申请中描述的改进使较少数量的关键帧具有创建的深度图和计算的中间深度图。由于关键帧表示帧总数的很小一部分，所以这种新技术代表了在时间和成本上都很有效的重要转换改进。Applicant's own previous publication relates to techniques for converting 2D images to stereoscopic 3D images. The disclosed conversion process involves generalization of a depth map associated with a 2D image. In one embodiment, the depth map is manually created frame by frame. The improvements described in this application enable a smaller number of keyframes with created depth maps and computed intermediate depth maps. Since keyframes represent a very small fraction of the total number of frames, this new technique represents a significant conversion improvement that is both time and cost efficient.

该公开物的特定内容是，本发明应该应用于深度图的创建上，而不是立体图的产生上。The specific content of this publication is that the invention should be applied to the creation of depth maps, not to the generation of stereo maps.

技术熟练人士将明白，该深度图在被称为旋转显示的过程中在特殊效果的工业中被广泛使用。为了在2D图像内合成现场动作或计算机产生的图像，通常必须手工产生每个2D图像帧的深度图或遮光。这些遮光使附加的图像进行合成，从而使之出现时以原始2D图像的近似几何尺寸进行移动。以上所描述的本发明能快速产生这种遮光。Those skilled in the art will appreciate that this depth map is widely used in the special effects industry in a process known as rotoscoping. In order to composite live action or computer-generated images within 2D images, depth maps or shading for each 2D image frame typically must be manually generated. These shadings composite the additional image so that it appears moved with the approximate geometry of the original 2D image. The invention described above enables rapid generation of such shading.

被开发的摄像机能从现场情景获得深度图也是公知的。通常这些采用激光测距技术并且通常称作LIDAR设备。为了以电视帧的速度捕捉深度图，需要一种昂贵且复杂的系统。本发明的申请物能使较简单且相对不复杂的LIDAR设备被设计得只需要以视频场速的一部分或在其它不寻常周期捕获深度图，以及捕获使用本发明所描述的技术通过插值来产生的缺少的深度图。常必须手工产生每个2D图像帧的深度图或遮光。这些遮光使附加的图像进行合成，从而使之出现时以原始2D图像的近似几何尺寸进行移动。以上所描述的本发明能快速产生这种遮光。It is also known to develop cameras capable of obtaining depth maps from live scenes. Typically these employ laser ranging technology and are commonly referred to as LIDAR devices. In order to capture depth maps at the speed of television frames, an expensive and complex system is required. The application of the present invention enables simpler and relatively uncomplicated LIDAR devices to be designed that only need to capture depth maps at a fraction of the video field rate or at other unusual periods, and the captures are generated by interpolation using the techniques described in this invention The missing depth map. Often a depth map or shading must be manually generated for each 2D image frame. These shadings composite the additional image so that it appears moved with the approximate geometry of the original 2D image. The invention described above enables rapid generation of such shading.

被开发的摄像机能从现场情景获得深度图也是公知的。通常这些采用激光测距技术并且通常称作LIDAR设备。为了以电视帧的速度捕捉深度图，需要一种昂贵且复杂的系统。本发明的申请物能使较简单且相对不复杂的LIDAR设备被设计得只需要以视频场速的一部分或在其它不寻常周期捕获深度图，以及捕获使用本发明所描述的技术通过插值来产生的缺少的深度图。It is also known to develop cameras capable of obtaining depth maps from live scenes. Typically these employ laser ranging technology and are commonly referred to as LIDAR devices. In order to capture depth maps at the speed of television frames, an expensive and complex system is required. The application of the present invention enables simpler and relatively uncomplicated LIDAR devices to be designed that only need to capture depth maps at a fraction of the video field rate or at other unusual periods, and the captures are generated by interpolation using the techniques described in this invention The missing depth map.

Claims

1. A method for creating a depth map, comprising the steps of:

assigning a depth to at least one pixel or part of the image;

determining the relative position and image characteristics of each of said at least one pixel or portion of said image;

using said depth, image properties and respective relative positions to determine the configuration of the first algorithm to determine depth properties as a function of relative position and image properties;

using said first algorithm to calculate a depth characteristic for each pixel or portion of said image;

Wherein, the depth characteristics form a depth map of the image.

2. A method for creating a depth map, comprising the steps of:

assigning a depth to at least one pixel or part of the image;

determining x,y coordinates and image properties for each of said at least one pixel or portion of said image;

using said depth, image properties and respective x, y coordinates to determine a first algorithm to determine depth properties as a function of x, y coordinates and image properties;

Wherein, the depth characteristics form a depth map of the image.

3. The method according to claim 1, wherein said image characteristics include RGB values.

4. A method according to any preceding claim, further comprising the step of reassigning depth to any pixel or part of the image to correct any inconsistencies.

5. A method according to any preceding claim, wherein said image characteristic comprises any of a luminance, chrominance, contrast or spatial measure.

6. The method according to any preceding claim, wherein said first algorithm can be represented by the following equation:

z=f(x, y, R, G, B) where x and y define the relative position of a sample.

7. A method according to any preceding claim, wherein a learning algorithm is used to determine the configuration of the first algorithm.

8. The method of claim 7, wherein, for each pixel in the image, the learning algorithm calculates:

z _n =K _a .x _n +k _b .y _n +k _c .R _n +k _d .G _n +k _e .B _n where

n is the nth pixel in the keyframe image

z _n is the depth value assigned to the pixel at x _n , y _n

k _a to k _e are constants, determined by the algorithm

R _n is the red component value of the pixel at x _n , y _n

G _n is the green component value of the pixel at x _n , y _n

B _n is the blue component value of the pixel at x _n , y _n .

9. A method according to claim 7 or 8, characterized in that a stochastic component is introduced into the learning algorithm to reduce overtraining.

10. The method according to claim 9, wherein said random component is a small positive or negative random number.

11. The method according to any one of claims 7-10, wherein the learning algorithm initially identifies pixels having characteristics similar to known pixels.

12. The method according to claim 11, characterized in that similar pixels are searched within a search radius.

13. The method of claim 12, wherein the search radius varies for each characteristic.

14. A method according to any one of claims 11 to 13, characterized in that the depth of a pixel is determined by a weighted average of the distances to similar pixels.

15. The method of claim 14, wherein the weighting value is inversely proportional to the distance.

16. The method of claim 7, wherein each property is segmented or divided into a set of regions and assigned a depth value according to the occupied region.

17. A method of creating a series of depth maps for a sequence of images comprising the steps of:

receiving a depth map of at least one frame in the sequence of images;

using said at least one depth map to determine a second configuration of a second algorithm to determine depth characteristics as a function of relative position and image characteristics;

The algorithm is utilized to create a depth map for each frame of the sequence of images.

18. A method of creating a series of depth maps of a sequence of images comprising the steps of:

receiving a depth map of at least one frame of the sequence of images;

determining a second algorithm using said at least one depth map to determine depth characteristics as a function of x, y coordinates and image characteristics;

19. The method according to claim 17 or claim 18, characterized in that at least two depth maps corresponding to at least two frames of the image sequence are received.

20. The method according to any one of claims 17 to 19, wherein said image characteristics include RGB values.

21. A method according to any one of claims 17 to 20, wherein said image characteristics comprise at least one of brightness, chromaticity, contrast or spatial measurements.

22. The method according to any one of claims 17 to 21, wherein a learning algorithm is used to determine the configuration of the second algorithm.

23. The method according to claim 22, wherein the learning algorithm is one of the backpropagation algorithm, the C4.5 algorithm or the K-means algorithm.

24. The method according to claim 22 or 23, wherein said second algorithm calculates:

z _n =K _a .x _n +k _b .y _n +k _c .R _n +k _d .G _n +k _e .B _n where

n is the nth pixel in the keyframe image

z _n is the depth value assigned to the pixel at x _n , y _n

k _a to k _e are constants, determined by the algorithm

R _n is the red component value of the pixel at x _n , y _n

G _n is the green component value of the pixel at x _n , y _n

B _n is the blue component value of the pixel at x _n , y _n .

25. A method according to any one of claims 17 to 24, characterized in that an additional algorithm configuration is created for each pair of frames for which a depth map has been received.

26. A method of creating a series of depth maps of a sequence of images comprising the steps of:

receiving depth maps of at least two keyframes of the sequence of images;

using said depth map to determine a second algorithm to determine depth characteristics as a function of x, y coordinates and image characteristics;

The algorithm is utilized to create a depth map for each frame of the image sequence, wherein frames adjacent to the key frame are processed before non-adjacent frames.

27. The method of claim 26, wherein once the adjacent keyframes have been processed, the adjacent keyframes are used as keyframes for creating further depth maps.

28. The method according to claim 22, 23, 26 or 27, wherein said second algorithm calculates:

z _n ＝K _a .x _n +k _b .y _n +k _c .R _n +k _d .G _n +k _e .B _n +k _f .T where

n is the nth pixel in the image

z _n is the depth value assigned to the pixel at x _n , y _n

k _a to k _f are constants previously determined by the algorithm

R _n is the red component value of the pixel at x _n , y _n

G _n is the green component value of the pixel at x _n , y _n

B _n is the blue component value of the pixel at x _n , y _n

T is the time measurement for that particular frame in the sequence.

29. A method of creating a sequence of depth maps of an image sequence, comprising the steps of:

selecting at least one key frame from said sequence of images;

For each of the at least one keyframe, assigning a depth to at least one pixel or a portion of each frame;

determining the relative position and image characteristics of each of said at least one pixel or a portion of each of said keyframes;

Using said depth, image characteristics and respective relative positions of each said at least one key frame to determine a first configuration of said first algorithm for each said at least one frame, thereby determining a function as a function of relative position and depth characteristics depth features;

calculating a depth characteristic of each pixel or portion of each of said at least one keyframe using said first algorithm;

Wherein, the depth characteristics form a depth map of each of the at least one key frame,

using each depth map to determine a second configuration of the second algorithm to determine depth characteristics for each frame as a function of relative position and image characteristics;

Using the second algorithm to create individual depth maps for each frame of the image sequence.

30. The method of claim 29, wherein frames adjacent to the key frame are processed before non-adjacent frames.

31. The method according to claim 30, characterized in that the next processed adjacent frame is regarded as a key frame for further processing.

32. A method of encoding a series of frames comprising transmitting with said frames at least one mapping function, characterized in that said mapping function comprises an algorithm for determining the depth features.

33. The method according to claim 32, wherein said image characteristics include RGB values.

34. A method according to claim 32 or 33, wherein said image characteristics comprise at least one of brightness, chromaticity, contrast or spatial measurements.

35. The method according to any one of claims 32 to 34, wherein said mapping function is determined using a learning algorithm.

36. The method according to claim 35, wherein the learning algorithm is one of the backpropagation algorithm, the C4.5 algorithm or the K-means algorithm.

37. according to the described method of claim 35 or 36, it is characterized in that, described mapping function calculation:

z _n =K _a .x _n +k _b .y _n +k _c .R _n +k _d .G _n +k _e .B _n where

n is the nth pixel in the keyframe image

z _n is the depth value assigned to the pixel at x _n , y _n

k _a to k _e are constants, determined by the algorithm

R _n is the red component value of the pixel at x _n , y _n

G _n is the green component value of the pixel at x _n , y _n

B _n is the blue component value of the pixel at x _n , y _n .

38. A method according to any one of claims 32 to 37, characterized in that an additional algorithm is created for each pair of frames for which a depth map has been received.