TWI748426B

TWI748426B - Method, system and computer program product for generating depth maps of monocular video frames

Info

Publication number: TWI748426B
Application number: TW109114069A
Authority: TW
Inventors: 楊家輝; 姚德威
Original assignee: 國立成功大學
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2021-12-01
Also published as: TW202141973A

Abstract

The proposed invention targets on an efficient 2D-to-3D conversion system, which can help stereo video content providers convert 2D content to 3D content with a small amount of labor cost. With AI-based keyframe selection network and AI-based depth map generation network for all 2D image frames, the proposed system can generate their corresponding depth maps to achieve a high-efficiency, high-quality and low-cost 2D-to-3D video conversion. Furthermore, the proposed system can be applied to all 2D videos, including animation or nature types of scene and all kinds of frame resolution and size.

Description

Single-view image depth map sequence generation method, system and computer program product

本發明是有關於一種將二維(two dimensional，2D)影像轉換為三維(three dimensional，3D)影像的技術。 The present invention relates to a technology for converting two dimensional (2D) images into three dimensional (3D) images.

近年來隨著3D顯示技術的蓬勃發展，豐富的3D影像內容生成成為一個重要的課題。目前我們有豐富2D影片，然而3D影像的資源在現今情況卻仍然非常缺乏，乃是因為3D影像的製作需要多個相機拍攝，不僅需要複雜的校正，拍攝困難且勞累外，多個相機的花費也非常昂貴。因此，將現有2D影像轉為3D影像應為一個最佳解。現有2D生成3D影像之技術乃是由人操作，利用彩圖RGB像素變化較大前後，選為關鍵畫面，再利用切圖的後製工具，生成關鍵畫面影格之深度圖後，再依據前後線索，基於區塊運動方向、區塊匹配的方式，內插求得前後影格間的所有深度圖。然而，基於像素差值選定之關鍵畫面，利用區塊的運動或匹配預測，會受限於前後畫面的大量移動影響，當運動量過大時，會造成選定大量關鍵畫面與嚴重的預測錯誤。因此，現有技術為了避免這項問題，後製時所選擇的前後畫面不能間格過長，需要以人工標記的數量約為整體畫面的30%~50%不等(即假設每秒有30張畫面，每秒必須人工標記9~15張的畫面)，此乃為巨大的人力成本與消耗。除此之外，像素的匹配及運動的預測方式無周圍資訊，效果不佳。而區塊的預測方式則容易在畫面有梯度變化與物件形變區域產生錯誤。 With the vigorous development of 3D display technology in recent years, the generation of rich 3D image content has become an important issue. Currently we have a wealth of 2D movies, but the resources of 3D images are still very scarce in today's situation. It is because the production of 3D images requires multiple cameras to shoot, which not only requires complex corrections, but also difficult and tiring shooting, and the cost of multiple cameras. It is also very expensive. Therefore, converting existing 2D images to 3D images should be the best solution. The existing 2D 3D image generation technology is operated by humans, using the color image RGB pixels before and after the large change, select the key image, and then use the post-production tool of cutting the image to generate the depth map of the key image frame, and then follow the clues before and after. , Based on the block movement direction and the block matching method, all the depth maps between the front and rear frames are obtained by interpolation. However, the key picture selected based on the pixel difference, using block motion or matching prediction, will be limited by the large amount of movement of the front and rear pictures Impact, when the amount of exercise is too large, it will cause a large number of key pictures to be selected and serious prediction errors. Therefore, in order to avoid this problem in the prior art, the front and rear frames selected during post-production cannot be too long, and the number of manual markings is about 30% to 50% of the overall picture (that is, assuming 30 frames per second) Picture, 9~15 pictures must be manually marked every second), which is a huge labor cost and consumption. In addition, the pixel matching and motion prediction methods have no surrounding information, and the effect is not good. The block prediction method is prone to errors in areas where the screen has gradient changes and object deformations.

本發明的實施例提出一種單視角影像深度圖序列生成方法，包括：取得多張畫面，並根據特徵轉換器將每一張畫面轉換為特徵圖；對特徵圖執行非監督式分群演算法以將特徵圖分為多個群組，並取得群組的特徵圖群心所對應的畫面做為多張關鍵畫面；提供一使用者介面以取得關鍵畫面的多張深度圖；根據特徵轉換器來初始化一深度圖生成網路，並根據關鍵畫面與深度圖來訓練深度圖生成網路；以及將畫面中除了關鍵畫面的其他畫面輸入至深度圖生成網路以計算出對應的深度圖。 The embodiment of the present invention proposes a method for generating a single-view image depth map sequence, which includes: obtaining multiple pictures, and converting each picture into a feature map according to a feature converter; The feature maps are divided into multiple groups, and the images corresponding to the group's feature maps are obtained as multiple key images; a user interface is provided to obtain multiple depth maps of the key images; initialize according to the feature converter A depth map generation network, and train the depth map generation network according to the key pictures and the depth map; and input the other pictures in the picture except the key pictures to the depth map generation network to calculate the corresponding depth map.

在一些實施例中，單視角影像深度圖序列生成方法還包括：根據畫面訓練自編碼器，此自編碼器包括特徵轉換器與特徵逆轉換器，特徵轉換器與特徵逆轉換器分別包括對應的神經網路。 In some embodiments, the single-view image depth map sequence generation method further includes: training an autoencoder according to the picture. The autoencoder includes a feature converter and a feature inverse converter. The feature converter and the feature inverse converter respectively include corresponding Neural network.

在一些實施例中，單視角影像深度圖序列生成方法還包括：設定多個候選群數目，對於每一個候選群數目都根據特徵圖執行非監督式分群演算法以計算輪廓係數；以及將最大輪廓係數所對應的候選群數目設定為非監督式分群演算法的群數目，此群數目相同於關鍵畫面的數目。 In some embodiments, the single-view image depth map sequence generation method It also includes: setting the number of multiple candidate groups, for each number of candidate groups, performing an unsupervised clustering algorithm according to the feature map to calculate the contour coefficient; and setting the number of candidate groups corresponding to the maximum contour coefficient to the unsupervised grouping calculation The number of groups of the law, the number of groups is the same as the number of key pictures.

在一些實施例中，上述的非監督式分群演算法為k均值演算法。 In some embodiments, the aforementioned unsupervised clustering algorithm is a k-means algorithm.

在一些實施例中，深度圖生成網路包括時空相似運算，此時空相似運算用以接收主要特徵圖與時序上相鄰於主要特徵圖的多張輔助特徵圖。對於主要特徵圖中的主要特徵點，時空相似運算取得輔助特徵圖中一預設範圍內的多個輔助特徵點，並計算主要特徵點與輔助特徵點之間的相似度以補償主要特徵點。 In some embodiments, the depth map generation network includes a time-space similarity operation, and the time-space similarity operation is used to receive the main feature map and multiple auxiliary feature maps adjacent to the main feature map in time sequence. For the main feature points in the main feature map, the space-time similarity operation obtains multiple auxiliary feature points in a preset range in the auxiliary feature map, and calculates the similarity between the main feature points and the auxiliary feature points to compensate the main feature points.

以另一個角度來說，本發明的實施例還提出一種電腦程式產品，由電腦系統載入並執行以完成上述的單視角影像深度圖序列生成方法。 From another perspective, the embodiment of the present invention also provides a computer program product, which is loaded and executed by a computer system to complete the above-mentioned single-view image depth map sequence generation method.

以另一個角度來說，本發明的實施例還提出一種單視角影像序列深度生成系統，包括記憶體與處理器。記憶體用以儲存多個指令，處理器用以執行這些指令以完成上述的單視角影像深度序列生成方法。 From another perspective, the embodiment of the present invention also provides a single-view image sequence depth generation system, including a memory and a processor. The memory is used to store a plurality of instructions, and the processor is used to execute these instructions to complete the above-mentioned single-view image depth sequence generation method.

為讓本發明的上述特徵和優點能更明顯易懂，下文特舉實施例，並配合所附圖式作詳細說明如下。 In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

100:單視角影像深度圖序列生成系統 100: Single-view image depth map sequence generation system

110:處理器 110: processor

120:記憶體 120: memory

200:影像序列 200: Image sequence

201,202:選取畫面 201, 202: Select screen

310:關鍵畫面萃取系統 310: Key Picture Extraction System

320,330:步驟 320,330: steps

311:關鍵畫面 311: Key Picture

312:畫面 312: Screen

321:深度圖 321: Depth Map

340:深度圖生成網路 340: Depth Map Generation Network

341:深度圖 341: Depth Map

410:特徵轉換器 410: feature converter

420:特徵逆轉換器 420: feature inverse converter

421:影像序列 421: Image Sequence

510:特徵圖 510: feature map

511:特徵圖群心 511: Feature Map Group Heart

520:步驟 520: step

610:特徵轉換器 610: feature converter

620:深度產生器 620: Depth Generator

701~703:輸入畫面 701~703: Input screen

710:特徵擷取卷積神經網路 710: Feature Extraction Convolutional Neural Network

711~713:特徵圖 711~713: feature map

714:特徵點 714: Feature Points

715:預設範圍 715: preset range

720:時空相似運算 720: Time and Space Similarity Operations

730:卷積神經網路 730: Convolutional Neural Network

[圖1]是根據一實施例繪示單視角影像深度圖序列生成系統的示意圖。 [Fig. 1] is a schematic diagram showing a single-view image depth map sequence generating system according to an embodiment.

[圖2]是根據一實施例繪示影像序列中多張畫面的示意圖。 [Fig. 2] is a schematic diagram showing multiple frames in an image sequence according to an embodiment.

[圖3]是根據一實施例繪示單視角影像深度圖序列生成方法的流程示意圖。 [Fig. 3] is a schematic flowchart of a method for generating a single-view image depth map sequence according to an embodiment.

[圖4]是根據一實施例繪示自編碼器的示意圖。 [Fig. 4] is a schematic diagram of a self-encoder according to an embodiment.

[圖5]是根據一實施例繪示非監督式分群演算法的示意圖。 [Fig. 5] is a schematic diagram showing an unsupervised grouping algorithm according to an embodiment.

[圖6]是根據一實施例繪示深度圖生成網路的示意圖。 [Fig. 6] is a schematic diagram showing a depth map generation network according to an embodiment.

[圖7]是根據一實施例繪示時空相似運算的示意圖。 [Fig. 7] is a schematic diagram illustrating a space-time similarity operation according to an embodiment.

[圖8]是根據一實施例繪示深度圖產生結果的示意圖。 [Fig. 8] is a schematic diagram showing the result of generating a depth map according to an embodiment.

圖1是根據一實施例繪示單視角影像深度圖序列生成系統的示意圖。請參照圖1，單視角影像深度圖序列生成系統100可以是智慧型手機、平板電腦、個人電腦、筆記型電腦、伺服器、工業電腦或具有計算能力的各種電子裝置等，本揭露並不在此限。單視角影像深度圖序列生成系統100包括了處理器110與記憶體120，處理器110電性連接至記憶體120。處理器110可為中央處理器(CPU)、圖形處理器(GPU)、微處理器、微控制器、數位信號處理器、影像處理晶片、特殊應用積體電路等，記憶體120可為揮發性記憶體或非揮發性記憶體。記憶體120中儲存有多個指令，處理器110會執行這些指令以完成一單視角影像深度圖序列生成方法，此方法是用以產生一影像序列對應畫面的深度圖序列。舉例來說，請參照圖2，圖2是根據一實施例繪示影像序列中多張畫面的示意圖。在此實施例中影像序列200的內容為動畫，但在其他實施例中也可以自然場景，本揭露並不限制影像序列200的內容。在此實施例中每張畫面為彩色的，也就是包括了紅色、綠色與藍色等通道，但在其他實施例中每張畫面也可以為灰階的。 FIG. 1 is a schematic diagram of a single-view image depth map sequence generating system according to an embodiment. Please refer to FIG. 1, the single-view image depth map sequence generation system 100 can be a smart phone, a tablet computer, a personal computer, a notebook computer, a server, an industrial computer, or various electronic devices with computing capabilities, etc. This disclosure is not here. limit. The single-view image depth map sequence generating system 100 includes a processor 110 and a memory 120, and the processor 110 is electrically connected to the memory 120. The processor 110 may be a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor, a microcontroller, a digital signal processor, an image processing chip, a special application integrated circuit, etc., and the memory 120 may be volatile Memory or non-volatile memory. A plurality of instructions are stored in the memory 120, and the processor 110 executes these instructions to complete a single-view image depth map sequence generation method. This method is used to generate an image The image sequence corresponds to the depth map sequence of the picture. For example, please refer to FIG. 2, which is a schematic diagram showing multiple frames in an image sequence according to an embodiment. In this embodiment, the content of the image sequence 200 is an animation, but in other embodiments, natural scenes can also be used, and the present disclosure does not limit the content of the image sequence 200. In this embodiment, each picture is colored, that is, it includes channels such as red, green, and blue, but in other embodiments, each picture may also be grayscale.

圖3是根據一實施例繪示單視角影像深度圖序列生成方法的流程示意圖。請參照圖3，關鍵畫面萃取系統310是一個軟體模組，用以從影像序列200中取得少數關鍵畫面311，這些關鍵畫面311為影像序列200中最具特徵性的畫面，關鍵畫面萃取系統310如何取得關鍵畫面311將在後續段落說明。使用者可依照所提供的使用者介面利用步驟320標記這些關鍵畫面311的對應深度值，藉此取得每張關鍵畫面的深度圖321。在一些實施例中，使用者介面用以讓使用者決定那些物件在前，那些物件在後，或者讓使用者決定深度值，然而本揭露並不限制使用者介面的內容。在步驟330，根據關鍵畫面311與其深度圖321訓練一深度圖生成網路340。然後可以將影像序列200中除了關鍵畫面311以外的畫面312輸入至深度圖生成網路340來產生對應的深度圖341。以下將先說明關鍵畫面萃取系統310。 FIG. 3 is a schematic flowchart of a method for generating a single-view image depth map sequence according to an embodiment. 3, the key picture extraction system 310 is a software module used to obtain a few key pictures 311 from the image sequence 200. These key pictures 311 are the most characteristic pictures in the image sequence 200. The key picture extraction system 310 How to obtain the key picture 311 will be explained in the subsequent paragraphs. The user can use step 320 to mark the corresponding depth values of these key frames 311 according to the provided user interface, thereby obtaining the depth map 321 of each key frame. In some embodiments, the user interface is used to allow the user to determine which objects are first and which objects are behind, or to allow the user to determine the depth value. However, the present disclosure does not limit the content of the user interface. In step 330, a depth map generation network 340 is trained based on the key image 311 and its depth map 321. Then, the images 312 other than the key images 311 in the image sequence 200 can be input to the depth map generation network 340 to generate the corresponding depth map 341. The key picture extraction system 310 will be described below first.

請參照圖4，關鍵畫面萃取系統310包含一特徵轉換器410與其對應的特徵逆轉換器420。特徵轉換器 410將影像序列200中的每一張畫面壓縮為較低維度的特徵圖510，此特徵圖510可透過特徵逆轉換器420來解壓縮以重建影像序列421。特徵轉換器410與特徵逆轉換器420都包括了各自的神經網路，兩者也可以合稱為一個自編碼器(auto encoder)。在此實施例中，影像序列200中每張畫面的高度為H，寬度為W，其中H與W為正整數，每張畫面都包括了紅色、綠色與藍色等3個通道。特徵轉換器410包括了7層卷積層與7層池化層，舉例來說，圖4中的“Conv.Block_1”便代表一層卷積層與一層池化層，以此類推，圖4中也繪示了各個特徵圖的大小，例如“

”便表示特徵圖的高度為

，寬度為

，通道數為32。經過這些卷積層與池化層的計算以後可得到高度為H/128、寬度為/128，通道數為512的特徵圖510以輸出至特徵逆轉換器420。特徵逆轉換器420是要根據這些特徵圖510重建出高度為H，寬度為W的重建影像序列421，這些影像序列421中的畫面應該要近似於影像序列200中原本的畫面，因此計算對應兩張畫面之間的損失(loss)可以更新自編碼器中特徵轉換器410與特徵逆轉換器420的權重，當影像序列421中的畫面非常接近或相同於影像序列200中的畫面時便表示訓練完畢。然而，本領域具有通常知識者當可理解自編碼器，在此不再詳細贅述。值得注意的是，在此是根據影像序列200中的畫面來訓練自編碼器，但這個訓練是非監督式的，並不需要使用者介入。此外，圖4中關於7層卷積神經網路的架構僅是範例，在其他實施例中可以採用其他層數及其他架構的卷積神經網路。 4, the key picture extraction system 310 includes a feature converter 410 and its corresponding feature inverse converter 420. The feature converter 410 compresses each frame in the image sequence 200 into a lower-dimensional feature map 510, and the feature map 510 can be decompressed by the feature inverse converter 420 to reconstruct the image sequence 421. Both the feature converter 410 and the feature inverse converter 420 include their own neural networks, and the two can also be collectively referred to as an auto encoder. In this embodiment, the height of each frame in the image sequence 200 is H and the width is W, where H and W are positive integers, and each frame includes three channels of red, green, and blue. The feature converter 410 includes a 7-layer convolutional layer and a 7-layer pooling layer. For example, "Conv.Block_1" in FIG. 4 represents a convolutional layer and a pooling layer, and so on, as shown in FIG. 4 Shows the size of each feature map, such as "

"Means that the height of the feature map is

, The width is

, The number of channels is 32. After calculation of these convolutional layers and pooling layers, a feature map 510 with a height of H/128, a width of /128, and a channel number of 512 can be obtained to output to the feature inverse converter 420. The feature inverse converter 420 is to reconstruct the reconstructed image sequence 421 with height H and width W based on these feature maps 510. The images in these image sequences 421 should be similar to the original images in the image sequence 200, so the calculation corresponds to two The loss between images can update the weights of the feature converter 410 and the feature inverse converter 420 in the encoder. When the images in the image sequence 421 are very close to or the same as the images in the image sequence 200, it means training complete. However, those with ordinary knowledge in the art should understand the autoencoder, and it will not be described in detail here. It is worth noting that here, the autoencoder is trained based on the images in the image sequence 200, but this training is unsupervised and does not require user intervention. In addition, the architecture of the 7-layer convolutional neural network in FIG. 4 is only an example. In other embodiments, a convolutional neural network with other layers and other structures may be used.

圖5是根據一實施例繪示非監督式分群演算法的示意圖。請參照圖4與圖5，在訓練完自編碼器以後，可從自編碼器中取得特徵轉換器410，然後根據此特徵轉換器410將影像序列200中的每一張畫面轉換為對應的特徵圖510。接下來在步驟520，對特徵圖510執行非監督式分群演算法，取得特徵圖群心511，其所對應的畫面選為關鍵畫面311。舉例來說，上述的非監督式分群演算法為k均值(k-means)演算法，此演算法可將特徵圖510分為多個群組，圖5中每個粒子都是對應至一張畫面的特徵圖，每個群組都具有一個特徵圖群心511，這些特徵圖群心511所對應的畫面可以當作是關鍵畫面311。 Fig. 5 is a schematic diagram illustrating an unsupervised grouping algorithm according to an embodiment. 4 and 5, after the self-encoder is trained, a feature converter 410 can be obtained from the self-encoder, and then each picture in the image sequence 200 is converted into a corresponding feature according to the feature converter 410 Figure 510. Next, in step 520, an unsupervised grouping algorithm is performed on the feature map 510 to obtain the feature map cluster center 511, and the corresponding picture is selected as the key picture 311. For example, the above-mentioned unsupervised clustering algorithm is the k-means algorithm. This algorithm can divide the feature map 510 into multiple groups. In Figure 5, each particle corresponds to one For the feature maps of the screen, each group has a feature map cluster center 511, and the screens corresponding to these feature map cluster centers 511 can be regarded as the key screen 311.

在一些實施例中，上述非監督式分群演算法中的群數目(即k均值演算法中的正整數k)也可自動計算出。具體來說，可先設定多個候選群數目，例如從2至N，N為任意合適的正整數。每個候選群數目都可以當作是上述的正整數k來執行k均值演算法，根據分群結果可以計算出一輪廓係數(Silhouette Coefficient)。此輪廓係數的計算如以下數學式1~3。 In some embodiments, the number of groups in the above-mentioned unsupervised grouping algorithm (that is, the positive integer k in the k-means algorithm) can also be automatically calculated. Specifically, the number of multiple candidate groups can be set first, for example, from 2 to N, where N is any suitable positive integer. The number of each candidate group can be regarded as the above-mentioned positive integer k to perform the k-means algorithm, and a silhouette coefficient (Silhouette Coefficient) can be calculated according to the grouping result. The calculation of the contour coefficient is as shown in the following mathematical formulas 1~3.

[數學式2]

[Math 2]

其中i、j為正整數。d(i,j)代表第i張特徵圖與第j張特徵圖之間的距離，例如為尤拉距離。C _i代表第i張特徵圖所屬的群組，C _k代表與i不同群組之所有特徵點之集合。數學式2表簇內相似度，即簇內特徵點之間之平均距離；數學式3表簇間相似度，即不同簇之特徵點間之平均距離，對於每張特徵圖都執行上述的數學式1~3。接著計算所有s(i)的平均以做為輪廓係數，輪廓係數越大表示分群的效果越好，因此可以把最大輪廓係數所對應的最佳候選群數目設定為k值，此k值也稱為非監督式分群演算法的群數目，此群數目便相同於關鍵畫面311的數目。 Among them, i and j are positive integers. d ( i , j ) represents the distance between the i-th feature map and the j-th feature map, for example, the Euler distance. C _i represents the group to which the i-th feature map belongs, and C _k represents the set of all the feature points in groups different from i. Mathematical formula 2 shows the similarity within clusters, that is, the average distance between the feature points in the cluster; Mathematical formula 3 shows the similarity between clusters, that is the average distance between the feature points of different clusters, and performs the above-mentioned mathematics for each feature map Formula 1~3. Then calculate the average of all s ( i ) as the contour coefficient. The larger the contour coefficient, the better the grouping effect. Therefore, the optimal number of candidate groups corresponding to the largest contour coefficient can be set to the value of k, which is also called It is the number of groups of the unsupervised grouping algorithm, and the number of groups is the same as the number of key pictures 311.

請參照圖2，其中選取畫面201、202便是關鍵畫面，可以看出影像序列200中主要包含了兩個角色，選取畫面201是一個角色的場景，而選取畫面202是兩個角色都在的場景，根據選取畫面201、202就足夠描述場景中的景深關係。 Please refer to Figure 2, where the selected pictures 201 and 202 are the key pictures. It can be seen that the image sequence 200 mainly contains two characters. The selected picture 201 is a scene of one character, and the selected picture 202 is where both characters are present. For the scene, selecting the screens 201 and 202 is sufficient to describe the depth of field relationship in the scene.

請參照回圖3，在取得關鍵畫面311以後，便可以透過使用者介面來取得關鍵畫面的深度圖321。以下將詳細說明深度圖生成網路340。圖6是根據一實施例繪示深度圖生成網路的示意圖。請參照圖6，深度圖生成網路340包括了特徵轉換器610與深度產生器620。值得注意的是特徵轉換器610為圖4的特徵轉換器410的一部份，特徵轉換器410中前5個特徵圖與特徵轉換器610的5個特徵圖的寬度、高度與通道數是相同的。因此，在一些實施例中可以根據特徵轉換器410的網路參數來初始化特徵轉換器610的網路參數，也就是初始設定相同的網路參數。在初始化以後，便可以根據關鍵畫面311與其深度圖321來訓練深度圖生成網路340。 Referring back to FIG. 3, after obtaining the key picture 311, the depth map 321 of the key picture can be obtained through the user interface. The depth map generation network 340 will be described in detail below. Fig. 6 is a schematic diagram illustrating a depth map generation network according to an embodiment. Please refer to FIG. 6, the depth map generation network 340 includes a feature converter 610 and a depth generator 620. Worth noting The feature converter 610 is a part of the feature converter 410 of FIG. 4, and the width, height, and number of channels of the first five feature maps of the feature converter 410 and the five feature maps of the feature converter 610 are the same. Therefore, in some embodiments, the network parameters of the feature converter 610 can be initialized according to the network parameters of the feature converter 410, that is, the same network parameters are initially set. After initialization, the depth map generation network 340 can be trained based on the key picture 311 and its depth map 321.

值得注意的是，在圖6中，“Conv_Block_1”等代表著卷積層與池化層，但“Conv_Block_3 Spatial-Temporal Similarity Block”則表示除了卷積層與池化層以外還採用一時空相似運算，此時空相似運算是用以根據時序上相鄰畫面的相似性來補償當前的畫面，因此共有T張畫面會輸入至深度圖生成網路340，其中T為正整數，例如為3。這T張畫面都會經過卷積層與池化層的運算以得到T個特徵圖。在時空相似運算中，當要處理一張特徵圖的一主要特徵點時會取得主要特徵圖與相鄰特徵圖中一預設範圍的輔助特徵點來補償當前的特徵點，如此一來可以考慮前後時序資訊，這會讓最終產生的深度圖在時序上更有一致性，可增加精確性及去除畫面跳動現象。具體來說，請參照圖7，圖7是根據一實施例繪示時空相似運算的示意圖。不論是在訓練階段或是推論階段，T張畫面701~703是時序上相鄰的畫面，中間畫面702是當前要訓練(或預測)的主要畫面，而畫面701、703是相鄰的輔助畫面。這些畫面701~703會經過特徵擷取卷積神經網路710，例如為圖6中的“Conv._Block_1”以及“Conv._Block_2”，之後會得到對應的特徵圖711~713，其中特徵圖712是當前要處理的主要特徵圖，特徵圖711、713則稱為輔助特徵圖。時空相似運算720會接收特徵圖711~713，對於主要特徵圖712中的一個特徵點714會取得主要特徵圖712以及輔助特徵圖711、713中預設範圍715內的多個輔助特徵點，並計算特徵點714與這些特徵點之間的相似度以補償特徵點714。本揭露並不限制預設範圍715的大小。上述補償的運算可以表示為以下數學式4~7。 It is worth noting that in Figure 6, "Conv_Block_1" and so on represent the convolutional layer and the pooling layer, but "Conv_Block_3 Spatial-Temporal Similarity Block" means that in addition to the convolutional layer and the pooling layer, a temporal and spatial similarity operation is also used. The spatio-temporal similarity operation is used to compensate the current frame based on the similarity of adjacent frames in time series. Therefore, a total of T frames will be input to the depth map generation network 340, where T is a positive integer, such as 3. These T pictures will go through the operation of the convolutional layer and the pooling layer to obtain T feature maps. In spatio-temporal similarity calculations, when a main feature point of a feature map is to be processed, a preset range of auxiliary feature points in the main feature map and the adjacent feature map will be obtained to compensate the current feature points. This can be considered Before and after timing information, this will make the final depth map more consistent in timing, which can increase accuracy and eliminate screen jitter. Specifically, please refer to FIG. 7, which is a schematic diagram illustrating a space-time similarity operation according to an embodiment. No matter in the training phase or the inference phase, T pictures 701~703 are adjacent pictures in time series, the middle picture 702 is the main picture currently to be trained (or predicted), and the pictures 701 and 703 are adjacent auxiliary pictures . These screens 701~703 will undergo feature extraction The convolutional neural network 710, such as "Conv._Block_1" and "Conv._Block_2" in Figure 6, will then get the corresponding feature maps 711~713, where the feature map 712 is the main feature map to be processed currently. Figures 711 and 713 are called auxiliary feature maps. The spatiotemporal similarity operation 720 will receive feature maps 711 to 713, and for a feature point 714 in the main feature map 712, it will obtain the main feature map 712 and multiple auxiliary feature points within the preset range 715 in the auxiliary feature maps 711 and 713, and The similarity between the feature points 714 and these feature points is calculated to compensate the feature points 714. The present disclosure does not limit the size of the preset range 715. The above-mentioned compensation calculation can be expressed as the following mathematical formulas 4-7.

[數學式4]z _i=h(y _i)+x _i [Math 4] z _i = h ( y _i )+ x _i

其中x _i表示特徵點714，z _i是經過增加補償後的特徵圖，y _i為相似加權特徵圖。h(y _i)及g(x _i)可視為分別對相似加權特徵圖y _i及x _i降維與升維的神經網路非線性轉換，它們亦可為對特徵圖做一般線性轉換，如h(y _i)=W _z y _i及g(x _j)=W _g x _j，權重矩陣W _g與W _z權重矩陣，本發明不限制是非線性神經網路或是線性。f(x _i ,x _j)是特徵圖的相似度加權值，C(x)為加權後之正規化值。上述數學式7是要計算兩個特徵點之間的相似度，但數學式7僅是範例，在其他實施例中也可以採用其他相似度。 Where x _i represents the feature point 714, z _i is the result of increasing the compensated characteristic of FIG., Y _i is the weighted feature similar to FIG. h ( y _i ) and g ( x _i ) can be regarded as the non-linear transformation of the neural network for dimensionality reduction and dimensionality enhancement of similarly weighted feature maps y _i and x _i , respectively. They can also be general linear transformations of feature maps, such as h ( y _i ) = W _z y _i and g ( x _j ) = W _g x _j , weight matrix W _g and W _z weight matrix, the present invention is not limited to non-linear neural network or linear. f ( x _i , x _j ) is the weighted value of the similarity of the feature map, and C ( x ) is the normalized value after weighting. The above mathematical formula 7 is to calculate the similarity between two feature points, but the mathematical formula 7 is only an example, and other similarities can also be used in other embodiments.

A表示特徵圖711、712、713中的預設範圍715內所有特徵點所形成的集合，x _j表示在預設範圍715內的特徵點。其中h(y _i)及g(x _i)為非線性或線性降維與升維轉換，可用於降低特徵點相似度計算之複雜度，在一些實施例中，此處亦可省略。對於連續影像序列，因應其畫面場景背景雷同，在此實施例中只計算預設範圍715中特徵點的相似度，能有效減少場景背景之冗餘特徵干擾，相較於習知技術來說除了可以減少計算量，亦可以提升預測準確度以及時序上一致性。 A represents the set formed by all the feature points in the preset range 715 in the feature maps 711, 712, 713, and x _j represents the feature points within the preset range 715. Among them, h ( y _i ) and g ( x _i ) are non-linear or linear dimensionality reduction and dimensionality increase transformations, which can be used to reduce the complexity of calculating the similarity of feature points. In some embodiments, this may be omitted here. For continuous image sequences, due to the similarity of the scene background, in this embodiment, only the similarity of the feature points in the preset range 715 is calculated, which can effectively reduce the redundant feature interference of the scene background. Compared with the conventional technology, except The amount of calculation can be reduced, and the prediction accuracy and timing consistency can also be improved.

在主要特徵圖712中每個特徵點都經過補償以後，補償後的特徵點會結合原先未補償之特徵並傳送至後續的卷積神經網路730，例如圖6中“Conv._Block_5”的卷積層與池化層，此處不限特徵結合方式，任何相加、串聯、相乘等特徵結合皆可使用。此時空相似運算可以加入至特徵轉換器610中的任何一個卷積層，在圖6的實施例中有兩層卷積層採用了時空相似運算，但在其他實施例中也可以在更多或更少的卷積層中採用時空相似運算。 After each feature point in the main feature map 712 is compensated, the compensated feature point will be combined with the original uncompensated feature and sent to the subsequent convolutional neural network 730, such as the volume "Conv._Block_5" in Figure 6 Multilayer and pooling layer, here is no limit to the feature combination method, any combination of features such as addition, series, multiplication, etc. can be used. The temporal-spatial similarity operation can be added to any convolutional layer in the feature converter 610. In the embodiment of FIG. 6, there are two convolutional layers that use the temporal-spatial similarity operation, but in other embodiments, more or less convolutional layers are used. Spatio-temporal similar operations are used in the convolutional layer of.

為加速上述做法，於時空相似運算後可釋放輔助畫面之特徵圖，此操作在圖6中標示為的“Discard Auxiliary Frames”。因此在“Conv.Block_5”的卷積層與之後的卷積層都不再需要輔助畫面的特徵圖。而後經時空相似運算補償後之特徵圖會經過深度產生器620的轉換並生成主要畫面之深度圖。 In order to speed up the above method, the feature map of the auxiliary screen can be released after the space-time similarity calculation. This operation is marked as "Discard" in Figure 6 Auxiliary Frames". Therefore, the convolutional layer of "Conv.Block_5" and the subsequent convolutional layers no longer need the feature map of the auxiliary picture. Then the feature map after compensation by the space-time similarity calculation will be converted by the depth generator 620 to generate the main The depth map of the screen.

根據上述的方法可訓練出深度圖生成網路340並用此深度圖生成網路340來產生除了關鍵畫面以外其他畫面的深度圖，這些結果如圖8所示。在此實施例中，使用者只要標記關鍵畫面的深度圖，經實際測試只需要標記約1%~3%的畫面，相較於習知技術可以大幅降低人工成本，且品質也優於已知的其他方法。這些產生的深度圖可以用於合成立體或多視角3D影片，可採用一般的深度影像繪圖(depth image based rendering，DIBR)演算法，本揭露並不限制如何合成立體或多視角3D影片。 According to the above-mentioned method, a depth map generation network 340 can be trained and used to generate depth maps of other pictures except the key picture. These results are shown in FIG. 8. In this embodiment, the user only needs to mark the depth map of the key picture. After actual testing, only about 1% to 3% of the picture needs to be marked. Compared with the conventional technology, the labor cost can be greatly reduced, and the quality is better than the known one. Other methods. These generated depth maps can be used to synthesize stereoscopic or multi-view 3D movies, and general depth image based rendering (DIBR) algorithms can be used. This disclosure does not limit how to synthesize stereo or multi-view 3D movies.

本揭露的貢獻之一為利用卷積神經網路搜尋關鍵畫面，藉此從影像序列中獲得能表示所有呈現物件之最具代表性的畫面，其他畫面可視為基於關鍵畫面的些微變形。本技術為了能提升分群的強健性，首先透過事先訓練好特徵轉換器，將畫面投射至較低的空間維度及較多的通道特徵數，能更有效率的實現關鍵畫面選取。 One of the contributions of this disclosure is to use convolutional neural networks to search for key images, thereby obtaining the most representative images from the image sequence that can represent all the presented objects, and other images can be regarded as slight distortions based on the key images. In order to improve the robustness of clustering, this technology first trains the feature converter in advance to project the image to a lower spatial dimension and a larger number of channel features, so that key image selection can be achieved more efficiently.

以另外一個角度來說，本發明也提出了一電腦程式產品，此產品可由任意的程式語言及/或平台所撰寫，當此電腦程式產品被載入至電腦系統並執行時，可執行上述的單視角影像深度圖序列生成方法。 From another perspective, the present invention also proposes a computer program product, which can be written in any programming language and/or platform. When the computer program product is loaded into the computer system and executed, it can execute the above Single-view image depth map sequence generation method.

雖然本發明已以實施例揭露如上，然其並非用以限定本發明，任何所屬技術領域中具有通常知識者，在不脫離本發明的精神和範圍內，當可作些許的更動與潤飾，故本發明的保護範圍當視後附的申請專利範圍所界定者為準。 Although the present invention has been disclosed as above in the embodiments, it is not intended to limit Anyone with ordinary knowledge in the technical field of the present invention can make some changes and modifications without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be defined by the scope of the attached patent application. Prevail.

200:影像序列 200: Image sequence

310:關鍵畫面萃取系統 310: Key Picture Extraction System

320,330:步驟 320,330: steps

311:關鍵畫面 311: Key Picture

312:畫面 312: Screen

321,341:深度圖 321,341: Depth Map

340:深度圖生成網路 340: Depth Map Generation Network

Claims

A method for generating a single-view image depth map sequence includes: obtaining multiple images, and converting each of the images into a feature map according to a feature converter; training an autoencoder according to the images, wherein the autoencoder It includes the feature converter and a feature inverse converter, the feature converter and the feature inverse converter each include a corresponding neural network; an unsupervised grouping algorithm is performed on the feature maps to divide the feature maps For multiple groups, and obtain the images corresponding to the feature maps of the groups as multiple key images; provide a user interface to obtain multiple depth maps of the key images; according to the feature The converter initializes a depth map generation network, and trains the depth map generation network according to the key pictures and the depth maps; and inputs the pictures other than the key pictures to the depth map Generate a network to calculate the corresponding depth map.

The single-view image depth map sequence generation method according to claim 1, further comprising: setting a plurality of candidate group numbers, and for each of the candidate group numbers, the unsupervised grouping algorithm is executed according to the feature maps to calculate a Contour factor; and The number of candidate groups corresponding to a largest one of the contour coefficients is set as the number of groups of the unsupervised grouping algorithm, and the number of groups is the same as the number of the key pictures.

The method for generating a single-view image depth map sequence according to claim 2, wherein the unsupervised grouping algorithm is a k-means algorithm.

The method for generating a single-view image depth map sequence according to claim 1, wherein the depth map generation network includes a spatio-temporal similarity operation, and the spatio-temporal similarity operation is used to receive a main feature map and a sequence adjacent to the main feature map A plurality of auxiliary feature maps in the main feature map, where for a main feature point in the main feature map, the space-time similarity operation obtains a plurality of auxiliary feature points within a preset range of the auxiliary feature maps, and calculates the main feature point and The similarity between each of the auxiliary feature points compensates for the main feature point.

A computer program product loaded and executed by a computer system to complete the single-view image depth map sequence generation method described in claim 1.

A single-view image depth map sequence generation system, including: a memory for storing multiple commands; A processor for executing the instructions to complete multiple steps: obtaining multiple pictures, and converting each of the pictures into a feature map according to a feature converter; training an autoencoder according to the pictures, wherein The autoencoder includes the feature converter and a feature inverse converter, the feature converter and the feature inverse converter each include a corresponding neural network; an unsupervised clustering algorithm is performed on the feature maps to The feature maps are divided into multiple groups, and the frames corresponding to the feature maps of the groups are obtained as multiple key frames; a user interface is provided to obtain multiple depth maps of the key frames ; According to the feature converter to initialize a depth map generation network, and train the depth map generation network according to the key pictures and the depth maps; and input other pictures of the pictures except the key pictures Go to the depth map generation network to calculate the corresponding depth map.