TWI846598B - 3d surface reconstruction method - Google Patents
3d surface reconstruction method Download PDFInfo
- Publication number
- TWI846598B TWI846598B TW112135412A TW112135412A TWI846598B TW I846598 B TWI846598 B TW I846598B TW 112135412 A TW112135412 A TW 112135412A TW 112135412 A TW112135412 A TW 112135412A TW I846598 B TWI846598 B TW I846598B
- Authority
- TW
- Taiwan
- Prior art keywords
- feature
- attention
- surface reconstruction
- reconstruction method
- information
- Prior art date
Links
Images
Landscapes
- Image Processing (AREA)
- Length Measuring Devices By Optical Means (AREA)
Abstract
Description
本案係有關一種利用轉換器(Transformer)模型之神經網路架構的三維表面重建方法。This case is about a three-dimensional surface reconstruction method using a neural network architecture of the Transformer model.
現有三維表面重建方法通常先透過相機擷取多視角影像,並結合相機同步定位與建圖(Simultaneous Localization And Mapping,SLAM)系統來取得相機姿態(camera poses)資訊。接著將不同視角的影像分別利用深度圖估計(Depth Map Estimation)方法生成深度圖,並利用立體匹配演算法(Stereo Matching Algorithm)匹配影像間的特徵像素點,最後結合深度圖、匹配的特徵像素點及相機姿態資訊融合成點雲、多邊形網路或截斷符號距離函數(Truncated Signed Distance Function,TSDF)等三維模型的表示格式。Existing 3D surface reconstruction methods usually first capture multi-view images through a camera and combine them with a camera simultaneous localization and mapping (SLAM) system to obtain camera pose information. Then, the images from different viewpoints are used to generate depth maps using the depth map estimation method, and the feature pixels between the images are matched using the stereo matching algorithm. Finally, the depth map, matched feature pixels and camera pose information are combined into a three-dimensional model representation format such as a point cloud, polygon network or truncated signed distance function (TSDF).
然而,前述方法通常會面臨二大困難點,第一是要融合成三維模型的深度圖都是來自於不同的視角,有著不同的相機姿態,本質上無法平滑的拼接融合。第二則是當影像的紋理(texture)過於平滑或重複時,會導致立體匹配演算法難以找到正確的匹配特徵像素點。因此,前述重建方法所面臨的兩種困難都會導致重建出來的三維模型表面出現許多重建缺陷。However, the aforementioned methods usually face two major difficulties. The first is that the depth maps to be fused into the three-dimensional model are from different perspectives and have different camera postures, and they cannot be smoothly stitched and fused. The second is that when the texture of the image is too smooth or repetitive, it will make it difficult for the stereo matching algorithm to find the correct matching feature pixels. Therefore, the two difficulties faced by the aforementioned reconstruction methods will lead to many reconstruction defects on the surface of the reconstructed three-dimensional model.
本案提供一種三維表面重建方法,包含:自一物體之複數多視角影像中萃取出複數視覺特徵,並取得複數相機姿態資訊。將相機姿態資訊轉換為複數姿態嵌入資訊。將視覺特徵與姿態嵌入資訊進行相加運算,以產生複數輸入資訊。將輸入資訊送入至一轉換器模型之一編碼器中,依序進行數層序列至序列的第一注意力運算,以輸出複數體積特徵。將體積特徵輸入至轉換器模型之一解碼器中,依序進行數層的第二注意力運算,以產生一特徵預測結果。將特徵預測結果之運算維度映射成一維之特徵維度。最後,根據特徵維度重新建構出物體之一三維模型。This case provides a three-dimensional surface reconstruction method, including: extracting multiple visual features from multiple multi-view images of an object and obtaining multiple camera posture information. Converting the camera posture information into multiple posture embedding information. Adding the visual features and the posture embedding information to generate multiple input information. Sending the input information to an encoder of a converter model, and sequentially performing several layers of sequence-to-sequence first attention operations to output multiple volume features. Inputting the volume features into a decoder of the converter model, and sequentially performing several layers of second attention operations to generate a feature prediction result. Mapping the operation dimension of the feature prediction result into a one-dimensional feature dimension. Finally, reconstructing a three-dimensional model of the object based on the feature dimension.
綜上所述,本案係為一種三維表面重建方法,其係利用轉換器(Transformer)模型網路架構能夠很好學習全局長程關係(global long range dependency)的特性,取代習知分別估計不同視角深度圖的方法,同時參考不同視角的影像來更新體積特徵(volumetric representation),達到改善物體表面細節的重建結果。因此,本案可以重建不同視角下的物體表面,並有效改善表面重建缺陷。In summary, this case is a 3D surface reconstruction method that uses the Transformer model network architecture to learn the characteristics of global long range dependency, replacing the method of learning to estimate depth maps at different view angles, and at the same time refers to images at different view angles to update the volumetric representation, thereby improving the reconstruction results of the surface details of the object. Therefore, this case can reconstruct the surface of the object at different view angles and effectively improve the surface reconstruction defects.
以下將配合相關圖式來說明本案的實施例。此外,實施例中的圖式有省略部份元件或結構,以清楚顯示本案的技術特點。在這些圖式中,相同的標號表示相同或類似的元件或電路,必須瞭解的是,儘管術語“第一”、“第二”等在本文中可以用於描述各種元件、部件、區域或功能,但是這些元件、部件、區域及/或功能不應受這些術語的限制,這些術語僅用於將一個元件、部件、區域或功能與另一個元件、部件、區域或功能區隔開來。The following will be used in conjunction with the relevant drawings to illustrate the embodiments of the present invention. In addition, the drawings in the embodiments omit some components or structures to clearly show the technical features of the present invention. In these drawings, the same reference numerals represent the same or similar components or circuits. It must be understood that although the terms "first", "second", etc. may be used in this article to describe various components, parts, regions or functions, these components, parts, regions and/or functions should not be limited by these terms. These terms are only used to separate one component, component, region or function from another component, component, region or function.
請參閱圖1所示,三維表面重建方法主要包含步驟S10至步驟S22。首先,如步驟S10所示,利用一捲積神經網路(Convolutional Neural Network,CNN)特徵萃取器自一物體之複數多視角影像中萃取出複數視覺特徵,並取得複數相機姿態(camera pose)資訊,此相機姿態資訊包含一相機位置座標及一相機拍攝角度。在一實施例中,特徵萃取器可以使用VGGNet、ResNet、EfficientNet、MobileNet、視覺轉換器(Vision Transformer)、Swin轉換器(Swin Transformer)等以深度學習為基礎的特徵萃取器。As shown in FIG. 1 , the three-dimensional surface reconstruction method mainly includes steps S10 to S22. First, as shown in step S10, a convolutional neural network (CNN) feature extractor is used to extract multiple visual features from multiple multi-view images of an object, and multiple camera pose information is obtained. The camera pose information includes a camera position coordinate and a camera shooting angle. In one embodiment, the feature extractor can use a feature extractor based on deep learning such as VGGNet, ResNet, EfficientNet, MobileNet, Vision Transformer, Swin Transformer, etc.
如步驟S12所示,將取得之相機姿態資訊透過一多層感知器(Multi-Layer Perceptron,MLP)來進行轉換,以將相機姿態資訊映射至維度N而轉換為複數姿態嵌入資訊,使維度N與多視角影像之視覺特徵具有相同的維度。As shown in step S12, the acquired camera pose information is transformed through a multi-layer perceptron (MLP) to map the camera pose information to dimension N and transform it into complex pose embedding information, so that the dimension N has the same dimension as the visual features of the multi-view image.
如步驟S14所示,將這些視覺特徵與姿態嵌入資訊進行相加運算,以進行特徵融合(feature fusion),進而產生複數輸入資訊,這些輸入資訊係可作為一轉換器模型之輸入。As shown in step S14, these visual features and posture embedding information are added to perform feature fusion to generate multiple input information, which can be used as input of a converter model.
請同時參閱圖1及圖2所示,如步驟S16所示,將輸入資訊輸入至轉換器模型(Transformer Model)10之一編碼器(Encoder)12中,依序進行數層序列至序列的第一注意力運算,以輸出複數體積特徵。在一實施例中,編碼器12係由複數個編碼區塊(Encoder Blocks)14所組成,每一編碼區塊14更包含一第一多頭自注意力(Muti-head Self Attention,MSA)模組16以及一第一前饋網路(Feed Forward Network,FFN)模組18,其中第一多頭自注意力模組16內部係使用注意力機制進行運算,第一前饋網路模組18負責持續強化特徵的細節表達能力。基此,透過第一多頭自注意力模組16可以將每一輸入資訊轉換為三個特徵向量(Q向量、K向量及V向量),並執行第一注意力運算,再透過第一前饋網路模組18強化細節,以輸出複數體積特徵,此體積特徵係包含有物體的一深度、一紋理及一空間資訊等資訊。以N個編碼區塊14來說,每一編碼區塊14中之第一多頭自注意力模組16會將上一層的輸出生成三個特徵向量並執行第一注意力運算,注意力機制能夠精煉物體影像特徵、提高特徵品質,使物體各個角度下的立體特徵會在一層一層的注意力機制下被提取,並經過第一前饋網路模組18強化細節表達能力。Please refer to FIG. 1 and FIG. 2 at the same time. As shown in step S16, the input information is input into an
請同時參閱圖1及圖2所示,如步驟S18所示,將體積特徵輸入至轉換器模型10之一解碼器(Decoder)20中,依序進行數層的第二注意力運算,以產生一特徵預測結果。在一實施例中,解碼器20係由複數個解碼區塊(Decoder Blocks)22所組成,每一解碼區塊22更包含一第二多頭自注意力(Muti-head Self Attention,MSA)模組24、一編解碼注意力(Encoder Decoder Attention,EDA)模組26以及一第二前饋網路(Feed Forward Network,FFN)模組28,其中第二多頭自注意力模組24及編解碼注意力模組26內部皆使用注意力機制進行運算,第二前饋網路模組28負責持續強化特徵的細節表達能力。基此,透過第二多頭自注意力模組24將上層輸出之特徵函式轉換為三個特徵向量,分別為Q(Query)向量、K(Key)向量及V(Value)向量,並執行第二注意力運算,使第二多頭自注意力模組24的輸出作為一第一特徵向量(V向量),且編解碼注意力模組26根據編碼區塊14輸出之體積特徵作為輸入而生成一第二特徵向量(Q向量)及一第三特徵向量(K向量),並依據第一特徵向量(V向量)、第二特徵向量(Q向量)及第三特徵向量(K向量)進行第二注意力運算,再透過第二前饋網路模組28強化細節,以輸出特徵預測結果。以N個解碼區塊22來說,每一解碼區塊22中之第二多頭自注意力模組24會將上一層的輸出生成三個特徵向量並執行第一次的第二注意力運算,不同的是在編解碼注意力模組26中會將編碼區塊14輸出之體積特徵作為輸入而生成第二特徵向量(Q向量)及第三特徵向量(K向量),並利用第二多頭自注意力模組24的輸出作為第一特徵向量(V向量),以進行第二注意力運算,最後再經過第二前饋網路模組28強化細節表達能力。Please refer to FIG. 1 and FIG. 2 at the same time. As shown in step S18, the volume feature is input into a
其中,第二多頭自注意力模組24在解碼區塊22中的作用同樣是精煉特徵函式(TSDF Queries)的品質,而編解碼注意力模組26跨編碼器、解碼器的注意力運算則是利用物體之體積特徵的表達能力一層一層引導特徵預測結果(例如TSDF特徵)的生成,同時將運算維度由高階的特徵維度映射成為本案使用之物體重構模型所在的空間維度。Among them, the role of the second multi-head self-
在一實施例中,轉換器模型10中之編碼器12中使用之第一注意力運算以及解碼器20中使用之第二注意力運算,都是採用下列公式(1)進行運算,其中,Q表示Q向量(第二特徵向量),K表示K向量(第三特徵向量),V表示V向量(第一特徵向量),上標T表示向量的轉置矩陣,
d
k 為K向量之維度(Dimension of K)。
(1)
In one embodiment, the first attention operation used in the
如步驟S20所示,將特徵預測結果之運算維度利用一截斷符號距離函數映射器(TSDF Projector)映射成原始維度的一維之特徵維度。最後,如步驟S22所示,根據特徵維度重新建構出物體之一三維模型,此三維模型係為以截斷符號距離函數(Truncated Signed Distance Function,TSDF)格式表達的三維模型重建結果。As shown in step S20, the calculation dimension of the feature prediction result is mapped into a one-dimensional feature dimension of the original dimension using a truncated signed distance function mapper (TSDF Projector). Finally, as shown in step S22, a three-dimensional model of the object is reconstructed according to the feature dimension. The three-dimensional model is a three-dimensional model reconstruction result expressed in a truncated signed distance function (TSDF) format.
請同時參閱圖1及圖3所示,本案係透過一三維表面重建系統30執行整個三維表面重建方法。三維表面重建系統30包含一影像擷取裝置32、一中央處理器(CPU)34以及一圖形處理器(GPU)36。影像擷取裝置32電性連接中央處理器34,且中央處理器34電性連接圖形處理器36,影像擷取裝置32係拍攝物體之各視角,以取得複數多視角影像,並將多視角影像傳送至中央處理器34,中央處理器34內建有一機器學習(Machine Learning)演算法,機器學習演算法可以用於執行圖1所示之步驟S10至步驟S22,基此,中央處理器34在接收到多視角影像後即可進行步驟S10至步驟S22所示之三維表面重建方法,且在過程中有關影像處理的部分,可以由中央處理器34傳送給圖形處理器36進行處理,以加速影像處理時間,圖形處理器36處理完成後再回傳給中央處理器34,最後由中央處理器34來輸出重建出來的三維模型。Please refer to FIG. 1 and FIG. 3 . The present invention implements the entire 3D surface reconstruction method through a 3D
在上述之實施例中,中央處理器34以及圖形處理器36係內建在一電子裝置38中,此電子裝置38可以是一個人電腦、一筆記型電腦或一平板電腦等,但本案不以此為限。在一實施例中,電子裝置38係使用中央處理器34來進行運算,除此之外,在其他實施例中,亦可以選擇使用嵌入式控制器(embedded controller,EC)、微處理器(Microprocessor)、數位訊號處理器(digital signal processor,DSP)、特定應用積體電路(Application-specific Integrated Circuit,ASIC)、系統單晶片(System on a Chip,SOC)或是其他的類似元件或組合等,但本案不以此為限。In the above-mentioned embodiment, the
在一實施例中,影像擷取裝置32可以是單色相機或彩色相機、立體相機、數位相機、數位攝影機、深度攝影機,或是其他能夠擷取影像的電子設備。In one embodiment, the
綜上所述,本案係為一種三維表面重建方法,其係利用轉換器(Transformer)模型網路架構能夠很好學習全局長程關係(global long range dependency)的特性,取代習知分別估計不同視角深度圖的方法,同時參考不同視角的影像來更新體積特徵(volumetric representation),以達到改善物體表面細節的重建結果。因此,本案可以重建不同視角下的物體表面,並有效改善習知表面重建缺陷之問題。In summary, this case is a 3D surface reconstruction method that uses the Transformer model network architecture to learn the characteristics of global long range dependency, replacing the learned method of estimating depth maps at different view angles, and at the same time refers to images at different view angles to update the volumetric representation to achieve improved reconstruction results of object surface details. Therefore, this case can reconstruct the surface of objects at different view angles and effectively improve the problem of learned surface reconstruction defects.
以上所述的實施例僅係為說明本案的技術思想及特點,其目的在使熟悉此項技術者能夠瞭解本案的內容並據以實施,當不能以之限定本案的專利範圍,即大凡依本案所揭示的精神所作的均等變化或修飾,仍應涵蓋在本案的申請專利範圍內。The embodiments described above are only for illustrating the technical ideas and features of this case. Their purpose is to enable those familiar with this technology to understand the content of this case and implement it accordingly. They cannot be used to limit the patent scope of this case. In other words, any equivalent changes or modifications made according to the spirit disclosed in this case should still be covered by the scope of the patent application of this case.
10:轉換器模型10: Converter Model
12:編碼器12: Encoder
14:編碼區塊14: Coding block
16:第一多頭自注意力模組16: The first multi-head self-attention module
18:第一前饋網路模組18: First Feedback Network Module
20:解碼器20:Decoder
22:解碼區塊22: Decoding block
24:第二多頭自注意力模組24: Second multi-head self-attention module
26:編解碼注意力模組26: Encoding and decoding attention module
28:第二前饋網路模組28: Second Feedback Network Module
30:三維表面重建系統30: 3D surface reconstruction system
32:影像擷取裝置32: Image capture device
34:中央處理器34:Central Processing Unit
36:圖形處理器36: Graphics Processor
38:電子裝置38: Electronic devices
S10~S22:步驟S10~S22: Steps
圖1為根據本案一實施例之三維表面重建方法的流程示意圖。 圖2為根據本案一實施例之三維表面重建方法使用之轉換器模型的架構示意圖。 圖3為根據本案一實施例之三維表面重建系統的方塊示意圖。 FIG. 1 is a schematic diagram of the process of a three-dimensional surface reconstruction method according to an embodiment of the present invention. FIG. 2 is a schematic diagram of the structure of a converter model used in a three-dimensional surface reconstruction method according to an embodiment of the present invention. FIG. 3 is a block diagram of a three-dimensional surface reconstruction system according to an embodiment of the present invention.
S10~S22:步驟 S10~S22: Steps
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112135412A TWI846598B (en) | 2023-09-15 | 2023-09-15 | 3d surface reconstruction method |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112135412A TWI846598B (en) | 2023-09-15 | 2023-09-15 | 3d surface reconstruction method |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TWI846598B true TWI846598B (en) | 2024-06-21 |
| TW202514540A TW202514540A (en) | 2025-04-01 |
Family
ID=92541863
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW112135412A TWI846598B (en) | 2023-09-15 | 2023-09-15 | 3d surface reconstruction method |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI846598B (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113822969A (en) * | 2021-09-15 | 2021-12-21 | 宿迁硅基智能科技有限公司 | Training neural radiation field model and face generation method, device and server |
| TW202213997A (en) * | 2020-09-16 | 2022-04-01 | 美商高通公司 | End-to-end neural network based video coding |
| TWI773458B (en) * | 2020-11-25 | 2022-08-01 | 大陸商北京市商湯科技開發有限公司 | Method, device, computer equipment and storage medium for reconstruction of human face |
| US11521379B1 (en) * | 2021-09-16 | 2022-12-06 | Nanjing University Of Information Sci. & Tech. | Method for flood disaster monitoring and disaster analysis based on vision transformer |
| CN116563856A (en) * | 2023-06-08 | 2023-08-08 | 浙江大学 | A named entity recognition method for image text, electronic equipment, and medium |
| CN116681859A (en) * | 2023-03-16 | 2023-09-01 | 珠海剑心互动娱乐有限公司 | Multi-view shape reconstruction method, system, electronic device and storage medium |
-
2023
- 2023-09-15 TW TW112135412A patent/TWI846598B/en active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| TW202213997A (en) * | 2020-09-16 | 2022-04-01 | 美商高通公司 | End-to-end neural network based video coding |
| TWI773458B (en) * | 2020-11-25 | 2022-08-01 | 大陸商北京市商湯科技開發有限公司 | Method, device, computer equipment and storage medium for reconstruction of human face |
| CN113822969A (en) * | 2021-09-15 | 2021-12-21 | 宿迁硅基智能科技有限公司 | Training neural radiation field model and face generation method, device and server |
| US11521379B1 (en) * | 2021-09-16 | 2022-12-06 | Nanjing University Of Information Sci. & Tech. | Method for flood disaster monitoring and disaster analysis based on vision transformer |
| CN116681859A (en) * | 2023-03-16 | 2023-09-01 | 珠海剑心互动娱乐有限公司 | Multi-view shape reconstruction method, system, electronic device and storage medium |
| CN116563856A (en) * | 2023-06-08 | 2023-08-08 | 浙江大学 | A named entity recognition method for image text, electronic equipment, and medium |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202514540A (en) | 2025-04-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN115115685B (en) | Monocular image depth estimation algorithm based on self-attention neural network | |
| CN109377530A (en) | A Binocular Depth Estimation Method Based on Deep Neural Network | |
| CN114581945A (en) | A monocular 3D human pose estimation method and system integrating spatiotemporal features | |
| CN114666564A (en) | Method for synthesizing virtual viewpoint image based on implicit neural scene representation | |
| Hong et al. | WSUIE: Weakly supervised underwater image enhancement for improved visual perception | |
| CN116740290B (en) | Three-dimensional interactive hands reconstruction method and system based on deformable attention | |
| CN111862299A (en) | Human body three-dimensional model construction method, device, robot and storage medium | |
| CN118505878A (en) | Three-dimensional reconstruction method and system for single-view repetitive object scene | |
| CN114926594B (en) | Single-view occluded human motion reconstruction method based on self-supervised spatiotemporal motion prior | |
| KR20230150867A (en) | Multi-view neural person prediction using implicit discriminative renderer to capture facial expressions, body posture geometry, and clothing performance | |
| Ramon et al. | Multi-view 3d face reconstruction in the wild using siamese networks | |
| CN116524121A (en) | Method, system, device and medium for three-dimensional human body reconstruction from monocular video | |
| Yang et al. | S3Net: A single stream structure for depth guided image relighting | |
| CN117218246A (en) | Training method, device, electronic equipment and storage medium for image generation model | |
| CN118522071A (en) | Three-dimensional human body posture estimation method based on bidirectional space-time characteristics, program product and electronic equipment | |
| CN117711066A (en) | Three-dimensional human body posture estimation method, device, equipment and medium | |
| CN117036620A (en) | Three-dimensional face reconstruction method based on single image | |
| TWI846598B (en) | 3d surface reconstruction method | |
| CN116778063A (en) | A fast virtual viewpoint synthesis method and device based on feature texture grid and hash coding | |
| CN115376209A (en) | A 3D Human Pose Estimation Method Based on Deep Learning | |
| Li et al. | Dynamic view synthesis with spatio-temporal feature warping from sparse views | |
| CN114298946A (en) | A Deep Learning Point Cloud Completion Method with Enhanced Frame Detail | |
| WO2024183896A1 (en) | Neural radiance fields for hand-object interaction modelling using a contact optimization field | |
| CN115497029A (en) | Video processing method, device and computer-readable storage medium | |
| CN115797727A (en) | Image augmentation method, device, electronic device and storage medium |