WO2023038369A1 - Semantic three-dimensional (3d) building augmentation - Google Patents
Semantic three-dimensional (3d) building augmentation Download PDFInfo
- Publication number
- WO2023038369A1 WO2023038369A1 PCT/KR2022/013187 KR2022013187W WO2023038369A1 WO 2023038369 A1 WO2023038369 A1 WO 2023038369A1 KR 2022013187 W KR2022013187 W KR 2022013187W WO 2023038369 A1 WO2023038369 A1 WO 2023038369A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- building
- labels
- semantic
- electronic device
- buildings
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/10—Geometric effects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
- G06T17/05—Geographic models
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
- G06T17/20—Finite element generation, e.g. wire-frame surface description, tesselation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
Definitions
- the disclosure of the invention is generally related to image processing systems and, more specifically, to a system for performing semantic three-dimensional (3D) building augmentation.
- the disclosure of the invention may provide a method that creates semantically labelled 3D models from geo-located street view images and geolocation-related data of buildings with building elevation data of entire cities. These data may include the geolocation and the general shape of the building (e.g., the height and building footprint shapefiles).
- Creating segmented 3D buildings of entire cities can be used not only for telecommunications planning but also for urban planning, autonomous vehicle navigation, indoor robot navigation, noise propagation simulation, solar radiation calculation, real estate trends, construction supplies demand estimation, and enforcing building standards.
- Current semantic 3D reconstruction with learning multi-view stereo and 2D segmentation of aerial images includes paper outlines a pipeline for constructing a 3D point cloud from a set of 2D images. From the input images, the following are acquired: 2D segmentation of the images, an estimation of the camera source location, and an estimation of the depths of objects of interest in the images. An initial point cloud is generated by combining the 2D segmentation results and depth maps. Label assignment on the point cloud is made through multi-view consistency. In order to remove noises from the point cloud, post-processing is done with the use of a graph-based method to establish connected points. Drone-captured images of vegetation, building, roads, vehicles, and background. The occlusions are dealt with by comparing image depth maps with other nearby depth maps, and their corresponding 2D segmentation results.
- Another practice of deep projective 3D semantic segmentation includes the segmentation of a 3D point cloud where images are generated from the point cloud input. And each image corresponds to a different view in the point cloud. A point splatting method is used to create these images. The 2D representations (images) are then segmented, and these segmentation labels are reprojected to the 3D point cloud. Then, the 3D point cloud to 2D images for segmentation, then back to 3D point cloud with segmentation labels.
- US8284190B2 discloses a registration of street-level imagery to 3D building models that corrects the origin point (camera coordinates) of a 2D street view image by optimizing a cost function based on the alignment of the edges of projected 2D buildings to their 3D model counterparts. It involves the extraction of building features from 2D street view images and its projection to their respective 3D models. Specifically, 3D LIDAR data along with a LIDAR edge detection method is used to identify the building edges and skyline in the 2D image. After projecting the 2D features to the 3D model, distance error between edges is used for the cost function to regress and correct the camera coordinates. Custom street view images, 3D building models, 3D LIDAR data. The building Edges and Skyline (separation between top of the building and the sky).
- US10643380B2 discloses a generating multi-dimensional building models with ground level images wherein a 3D point cloud is created from the ground-level images covering multiple building views. Vertices that correspond to building edges are manually/semi-automatically labelled in the 3D point cloud, such that these vertices form the edges to a planar surface. Surfaces are used to create simple facade geometry, and are textured. Non-edge points are correlated to planar surfaces, and surfaces are adjusted to fit to the correlated points. Surfaces are used to reconstruct a textured, 3D building model. Orthogonal, ground-level images. Manual/semi-automatic selection of edges, planar surfaces.
- US 2001/0038718 A1 discloses a method and apparatus for performing geo-spatial registration of imagery.
- the system utilizes the imagery and terrain information contained in the geo-spatial database to precisely align geodetically calibrated reference imagery with an input image, e.g., dynamically generated video images, and thus achieve a high accuracy identification of locations within the scene.
- an input image e.g., dynamically generated video images
- the system recalls a reference image pertaining to the imaged scene. This reference image is aligned very accurately with the sensor's images using a parametric transformation.
- related art uses heuristics-guided methods for extracting the building from the 2D street view images, in this case primarily using building edges and skylines while related art does not address problem cases where occluders, such as trees or electricity posts, obscure the view of the buildings in the image.
- the subject invention discloses the use of maps in street view images, and 3D building map vendor geodata; automated semantic segmentation of features and microstructures, Buildings with particular interest in microstructures, e.g. pillars, stairs, doors that are part of the building, occlusions are removed by way of inpainting, and 2D images and building map for segmentation then projection to 3D mesh with segmentation labels.
- the subject invention can semantically segment images into building parts including occluded regions. It can also project the 2D semantic segmentation labels to the 3D models and post-process initial semantic label projection, which includes, but is not limited to, defined microstructures present in the building, and unlabelled or initially unseen faces of the initial model, in order to provide a more complete set of semantic labels for 3D models.
- the subject invention can be used to create semantically labelled 3D models for 5G wave propagation modeling and other telecommunications planning tasks. Based on initial results, the method is able to project semantic labels from street view images to 3D models of buildings using only shapefiles with building elevation.
- the labelled 3D models output of the subject invention further can be used to estimate construction supplies needed by cities in the future as parts of the buildings such as windows and doors have certain lifetime.
- the subject invention can be used for real estate trends wherein architectural trends and age of buildings can be inferred from the labelled 3D models.
- the labelled 3D models created by the subject invention further can be used to enforcing building standards in different cities.
- a method of creating semantic 3D building augmentation may include acquiring shapefile of an area and street view images taken in that area.
- the method may include converting the shapefile to a triangular mesh and computing camera parameters.
- the method may include extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation.
- the method may include projecting the 2D semantic segmentation labels to 3D models.
- the method may include post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
- an electronic device including at least one memory configured to store instructions, and at least one processor.
- the at least one processor may be configured, when executing the instructions, to acquire shapefile of an area and street view images taken in that area.
- the at least one processor may be configured to convert the shapefile to a triangular mesh and computing camera parameters.
- the at least one processor may be configured to extract from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation.
- the at least one processor may be configured to project the 2D semantic segmentation labels to 3D models.
- the at least one processor may be configured to post-process of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
- a machine-readable medium containing instructions that, when executed, cause at least one processor of an electronic device.
- the machine-readable medium may cause at least one processor of an electronic device to obtain multiple video frames of a video stream and multiple depth frames corresponding to the multiple video frames.
- the machine-readable medium may cause at least one processor of an electronic device to generate multiple blur kernel maps based on the multiple depth frames.
- the machine-readable medium may cause at least one processor of an electronic device to reduce depth errors in each of the multiple blur kernel maps.
- the machine-readable medium may cause at least one processor of an electronic device to perform temporal smoothing on the multiple blur kernel maps to suppress temporal artifacts between different ones of the multiple blur kernel maps.
- the machine-readable medium may cause at least one processor of an electronic device to generate blur effects in the video stream using the multiple blur kernel maps.
- Figure 1 is a flow diagram illustrating an embodiment of a method of creating semantic 3D building augmentation.
- Figure 2 is a block diagram with illustrative views of the 2D semantic segmentation.
- Figure 3 is an example block diagram of the method of 2D to 3D projection.
- Figure 4 is an example block diagram of the method of the post-processing.
- Figure 5 is an illustrative sample of 2D semantic segmentation.
- Figure 6 is an illustrative sample of house detection.
- Figure 7 is an illustrative sample results of shapefile splitting and matching of street view images.
- Figure 8 is an illustrative sample results of 2D Projection of texture and semantic label pixels.
- Figure 9 is an illustrative sample results of post-processing.
- Figure 10 is a block diagram of an electronic device according to embodiments.
- Figure 11 is a flowchart illustrating a method of creating semantic 3D building augmentation according to an embodiment of the disclosure.
- the method of creating semantic 3D building augmentation 100 comprising the steps of acquiring input data comprising of shapefile of an area and street view images taken in that area, comprising of shapefile of building elevation 101, camera location and field of view 102, and street view images 103.
- the shapefile of building elevation 101 being converted to a triangular mesh 104 while camera parameters from the camera location and field of view 102 being computed which includes the camera intrinsic and extrinsic 105.
- automated 2D semantic segmentation 106 the pixelwise location of the building and its features including occluded regions by artifacts such as trees, people, cars, etc., being extracted from street view images 103.
- the 2D semantic segmentation 106 is used to extract from street view images the pixelwise location of the building and its features. Ideally, this can be done directly by training a network directly from the dataset and using the output of the network as labelled image. However, in most cases, buildings and houses are occluded in the image by artifacts such as trees, people, cars, etc. This will result in a loss of some information which may be relevant to the final 3D model output which is the whole house and the microstructures that can be projected to 3D without the occlusions.
- the 2D semantic segmentation 106 further comprising of the following steps: generating mask and masked image of occluded regions 111 by predicting walls and windows as parts of the buildings, and occluded regions from the semantic segmentation from base image 110 using PSPNet such as disclosed by Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881-2890) https://arxiv.org/abs/1612.01105.
- the inpainted image 112 being subjected to semantic segmentation 113 and the bounding box for house detection being generated 114 wherein the detected houses and buildings adapted to isolate from the label image the building of interest using Faster-RCNN such as disclosed by Ren, S., He, K., Girshick, R., & Sun, J. (2015).
- Faster r-cnn Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99), https://arxiv.org/abs/1506.01497.
- the 2D semantic segmentation labels to 3D models projection 107 comprising the steps of matching building triangular mesh to their corresponding images 115, projecting 2D semantic labels to the building triangular mesh 116 using pinhole camera model, and processing pose correction 117 adapted to handle errors using the building bounding box.
- the post-processing augments the initial labelled 3D model produced by 2D-3D projection which could be incompletely labelled.
- the Post-Processing block uses heuristics based on assumptions for buildings to complete the labels on the 3D model.
- the post-processing 108 is comprising capturing mesh views 118 where each side of the initially labelled 3D model are rendered to images, 2D post-processing 119 where labels are being completed. It is further divided into the following processes: pre-processing 120 which involves extracting information about the microstructures, such as but not limited to windows and doors, and correcting rendering errors, and view processing 121 which involves the label completion of each generated view.
- pre-processing 120 which involves extracting information about the microstructures, such as but not limited to windows and doors, and correcting rendering errors
- view processing 121 which involves the label completion of each generated view.
- the horizontal boundary between the wall and the roof is found, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls.
- the final processing 122 is used to complete labels using all views as a whole. This includes but not limited to asserting wall continuity across views while the label reprojection 123 where final 2D labels are reprojected back to the 3D model.
- the input image 200 being processed using automated 2D semantic segmentation 10 rendering a segmented view 201.
- the input image 202 generate bounding box for house detection 203 also using automated 2D semantic segmentation 106.
- the separated 3D object 204 renders shapefile splitting and matching of street view images 205.
- the 2D projection 206a, 206b render texture and semantic label pixels 207a, 207b, respectively, through 2D to 3D projection process 107.
- the post-processing 108 renders the following: projected output 208 which includes input to post process, the output after capturing 209 each side of mesh, output after pre-processing 210, output after propagating the walls 211 and lastly, re-projected final output 212.
- the re-projected final output 212 can be used to create semantically labelled 3D models for 5G wave propagation modeling and other telecommunications planning tasks. Based on initial results, the method is able to project semantic labels from street view images to 3D models of buildings using only shapefiles with building elevation. It can be used to estimate supplies needed by cities in the future as parts of the buildings such as windows and doors have certain lifetime. The architectural trends and age of buildings can be inferred from the labelled 3D models. These can then be used for real estate trends. Lastly, the labelled 3D models created can be used to enforcing building standards in different cities.
- Figure 10 is a block diagram of an electronic device 1000 according to embodiments of the disclosure.
- FIG. 10 is for illustration only, and other embodiments of the electronic device 1000 could be used without departing from the scope of this disclosure.
- the electronic device 1000 may not include some of the illustrated components (e.g., interface 1400, display 1500, or etc.), or may additionally include other components.
- the electronic device 1000 includes a bus 1010, a processor 1020, a memory 1030, an interface 1040, and a display 1050.
- the bus 1010 includes a circuit for connecting the components 1020 to 1050 with one another.
- the bus 1010 functions as a communication system for transferring data between the components 1020 to 1050 or between electronic devices.
- the processor 1020 includes one or more of a central processing unit (CPU), a graphics processor unit (GPU), an accelerated processing unit (APU), a many integrated core (MIC), a field-programmable gate array (FPGA), or a digital signal processor (DSP).
- the processor 1020 is able to perform control of any one or any combination of the other components of the electronic device 1000, and/or perform an operation or data processing relating to communication.
- the processor 1020 executes one or more programs stored in the memory 1030.
- the memory 1030 may include a volatile and/or non-volatile memory.
- the memory 1030 stores information, such as one or more of commands, data, programs (one or more instructions), applications 1034, etc., which are related to at least one other component of the electronic device 1000 and for driving and controlling the electronic device 1000.
- commands and/or data may formulate an operating system (OS) 1032.
- OS operating system
- Information stored in the memory 1030 may be executed by the processor 1020.
- the applications 1034 include the above-discussed embodiments. These functions can be performed by a single application or by multiple applications that each carry out one or more of these functions.
- the display 1050 includes, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a quantum-dot light emitting diode (QLED) display, a microelectromechanical systems (MEMS) display, or an electronic paper display.
- the display 1050 can also be a depth-aware display, such as a multi-focal display.
- the display 1050 is able to present, for example, various contents, such as text, images, videos, icons, and symbols.
- the interface 1040 may include input/output (I/O) interface 1042, communication interface 1044, and/or one or more sensors 1046.
- the I/O interface 1042 serves as an interface that can, for example, transfer commands and/or data between a user and/or other external devices and other component(s) of the electronic device 1000.
- the sensor(s) 1046 can meter a physical quantity or detect an activation state of the electronic device 1000 and convert metered or detected information into an electrical signal.
- the sensor(s) 1046 can include one or more cameras or other imaging sensors for capturing images of scenes.
- the sensor(s) 1046 can also include any one or any combination of a microphone, a keyboard, a mouse, one or more buttons for touch input, a gyroscope or gyro sensor, an air pressure sensor, a magnetic sensor or magnetometer, an acceleration sensor or accelerometer, a grip sensor, a proximity sensor, a color sensor (such as a red green blue (RGB) sensor), a bio-physical sensor, a temperature sensor, a humidity sensor, an illumination sensor, an ultraviolet (UV) sensor, an electromyography (EMG) sensor, an electroencephalogram (EEG) sensor, an electrocardiogram (ECG) sensor, an infrared (IR) sensor, an ultrasound sensor, an iris sensor, and a fingerprint sensor.
- a microphone a keyboard, a mouse, one or more buttons for touch input
- a gyroscope or gyro sensor an air pressure sensor
- a magnetic sensor or magnetometer an acceleration sensor or accelerometer
- a grip sensor a proximity sensor
- the sensor(s) 1046 can further include an inertial measurement unit.
- the sensor(s) 1046 can include a control circuit for controlling at least one of the sensors included herein. Any of these sensor(s) 1046 can be located within or coupled to the electronic device 1000.
- the sensors 1046 may be used to detect touch input, gesture input, and hovering input, using an electronic pen or a body portion of a user, etc.
- the communication interface 1044 for example, is able to set up communication between the electronic device 1000 and an external electronic device, or a server, the communication interface 1044 can be connected with a network through wireless or wired communication architecture to communicate with the external electronic device.
- the communication interface 1044 can be a wired or wireless transceiver or any other component for transmitting and receiving signals.
- Figure 11 is a flowchart illustrating a method of creating semantic 3D building augmentation according to an embodiment of the disclosure.
- the method 1100 may be performed by at least one processor using the electronic 1000 of Figure 1000.
- the method 1100 includes acquiring shapefile of an area and street view images taken in that area.
- the method 1100 includes converting the shapefile to a triangular mesh and computing camera parameters.
- the method 1100 includes extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation.
- the occluded regions from street view images are extracted using inpainting of occlusions.
- the automated 2D semantic segmentation includes the steps of predicting walls and windows that are part of buildings, predicting occluders, such as trees and cars, that obstruct parts of buildings, inpainting possible parts of the building blocked by the occluders, and detecting houses and buildings adapted to isolate from the label image the building of interest.
- the method 1100 includes projecting the 2D semantic segmentation labels to 3D models.
- the projecting the 2D semantic segmentation labels to 3D models projection includes the steps of matching building triangular mesh to their corresponding images, projecting 2D semantic labels to the building triangular mesh using pinhole camera model, and processing pose correction adapted to handle errors using the building bounding box.
- the method 1100 includes post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
- the post-processing includes the steps of capturing mesh views adapted for each side of the initially labelled 3D model are rendered to images, 2D post-processing adapted for completing the labels, and reprojecting back the final 2D labels to the 3D model.
- the post-processing includes the steps of extracting information about the microstructures, and correcting rendering errors, view processing comprising of label completion of each generated view, the horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls, and completing labels using all views, it includes asserting wall continuity across views.
- a method of creating semantic 3D building augmentation may include acquiring shapefile of an area and street view images taken in that area.
- the method may include converting the shapefile to a triangular mesh and computing camera parameters.
- the method may include extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation.
- the method may include projecting the 2D semantic segmentation labels to 3D models.
- the method may include post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
- the occluded regions from street view images may be extracted using inpainting of occlusions.
- the method may include predicting walls and windows that are part of buildings.
- the method may include predicting occluders, such as trees and cars, that obstruct parts of buildings.
- the method may include inpainting possible parts of the building blocked by the occluders.
- the method may include detecting houses and buildings adapted to isolate from the label image the building of interest.
- the method may include matching building triangular mesh to their corresponding images.
- the method may include projecting 2D semantic labels to the building triangular mesh using pinhole camera model.
- the method may include processing pose correction adapted to handle errors using the building bounding box.
- the method may include capturing mesh views adapted for each side of the initially labelled 3D model are rendered to images.
- the method may include 2D post-processing adapted for completing the labels.
- the method may include reprojecting back the final 2D labels to the 3D model.
- the method may include extracting information about the microstructures, and correcting rendering errors.
- the method may include view processing comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls.
- the method may include completing labels using all views; wherein it includes asserting wall continuity across views.
- an electronic device including at least one memory configured to store instructions, and at least one processor.
- the at least one processor may be configured, when executing the instructions, to acquire shapefile of an area and street view images taken in that area.
- the at least one processor may be configured to convert the shapefile to a triangular mesh and computing camera parameters.
- the at least one processor may be configured to extract from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation.
- the at least one processor may be configured to project the 2D semantic segmentation labels to 3D models.
- the at least one processor may be configured to post-process of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
- the occluded regions from street view images may be extracted using inpainting of occlusions.
- the at least one processor may be configured to predict walls and windows that are part of buildings.
- the at least one processor may be configured to predict occluders, such as trees and cars, that obstruct parts of buildings.
- the at least one processor may be configured toinpaint possible parts of the building blocked by the occluders.
- the at least one processor may be configured to detect houses and buildings adapted to isolate from the label image the building of interest.
- the at least one processor may be configured to match building triangular mesh to their corresponding images.
- the at least one processor may be configured to project 2D semantic labels to the building triangular mesh using pinhole camera model.
- the at least one processor may be configured to process pose correction adapted to handle errors using the building bounding box.
- the at least one processor may be configured to capture mesh views adapted for each side of the initially labelled 3D model are rendered to images.
- the at least one processor may be configured to 2D post-processing adapted for completing the labels.
- the at least one processor may be configured to reproject back the final 2D labels to the 3D model.
- the at least one processor may be configured to extract information about the microstructures, and correcting rendering errors.
- the at least one processor may be configured to view process comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls.
- the at least one processor may be configured to complete labels using all views; wherein it includes asserting wall continuity across views.
- a machine-readable medium containing instructions that, when executed, cause at least one processor of an electronic device.
- the machine-readable medium may cause at least one processor of an electronic device to obtain multiple video frames of a video stream and multiple depth frames corresponding to the multiple video frames.
- the machine-readable medium may cause at least one processor of an electronic device to generate multiple blur kernel maps based on the multiple depth frames.
- the machine-readable medium may cause at least one processor of an electronic device to reduce depth errors in each of the multiple blur kernel maps.
- the machine-readable medium may cause at least one processor of an electronic device to perform temporal smoothing on the multiple blur kernel maps to suppress temporal artifacts between different ones of the multiple blur kernel maps.
- the machine-readable medium may cause at least one processor of an electronic device to generate blur effects in the video stream using the multiple blur kernel maps.
- the occluded regions from street view images may be extracted using inpainting of occlusions.
- the machine-readable medium may cause at least one processor of an electronic device to predict walls and windows that are part of buildings.
- the machine-readable medium may cause at least one processor of an electronic device to predict occluders, such as trees and cars, that obstruct parts of buildings.
- the machine-readable medium may cause at least one processor of an electronic device toinpaint possible parts of the building blocked by the occluders.
- the machine-readable medium may cause at least one processor of an electronic device to detect houses and buildings adapted to isolate from the label image the building of interest.
- the machine-readable medium may cause at least one processor of an electronic device to match building triangular mesh to their corresponding images.
- the machine-readable medium may cause at least one processor of an electronic device to project 2D semantic labels to the building triangular mesh using pinhole camera model.
- the machine-readable medium may cause at least one processor of an electronic device to process pose correction adapted to handle errors using the building bounding box.
- the machine-readable medium may cause at least one processor of an electronic device to capture mesh views adapted for each side of the initially labelled 3D model are rendered to images.
- the machine-readable medium may cause at least one processor of an electronic device to 2D post-processing adapted for completing the labels.
- the machine-readable medium may cause at least one processor of an electronic device to reproject back the final 2D labels to the 3D model.
- the machine-readable medium may cause at least one processor of an electronic device to extract information about the microstructures, and correcting rendering errors.
- the machine-readable medium may cause at least one processor of an electronic device to view process comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls.
- the machine-readable medium may cause at least one processor of an electronic device to complete labels using all views; wherein it includes asserting wall continuity across views.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Geometry (AREA)
- Multimedia (AREA)
- Computer Graphics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Remote Sensing (AREA)
- Processing Or Creating Images (AREA)
Abstract
The subject invention discloses a method to semantically label 3D models of buildings from the shape file of an area and street view images taken in that area. The invention further can semantically segment images into building parts including occluded regions. Moreover, the invention can project the 2D semantic segmentation labels to the 3D models.
Description
The disclosure of the invention is generally related to image processing systems and, more specifically, to a system for performing semantic three-dimensional (3D) building augmentation.
With the advent of 5G (fifth-generation) networks, there is a need for 3D models of buildings with semantic information for wave propagation modeling. The disclosure of the invention may provide a method that creates semantically labelled 3D models from geo-located street view images and geolocation-related data of buildings with building elevation data of entire cities. These data may include the geolocation and the general shape of the building (e.g., the height and building footprint shapefiles).
Related arts used 3D LIDAR data and point cloud data as input to create semantically labelled 3D models which are not as readily available compared to street view images and shapefiles. Related arts also did not look into segmenting microstructures in the building such as windows and doors which are important to create accurate 3D models for simulations.
Lastly, no related art addressed the issue that arises when objects and structures such as trees and posts occlude the building.
Creating segmented 3D buildings of entire cities can be used not only for telecommunications planning but also for urban planning, autonomous vehicle navigation, indoor robot navigation, noise propagation simulation, solar radiation calculation, real estate trends, construction supplies demand estimation, and enforcing building standards.
Current semantic 3D reconstruction with learning multi-view stereo and 2D segmentation of aerial images includes paper outlines a pipeline for constructing a 3D point cloud from a set of 2D images. From the input images, the following are acquired: 2D segmentation of the images, an estimation of the camera source location, and an estimation of the depths of objects of interest in the images. An initial point cloud is generated by combining the 2D segmentation results and depth maps. Label assignment on the point cloud is made through multi-view consistency. In order to remove noises from the point cloud, post-processing is done with the use of a graph-based method to establish connected points. Drone-captured images of vegetation, building, roads, vehicles, and background. The occlusions are dealt with by comparing image depth maps with other nearby depth maps, and their corresponding 2D segmentation results.
Another practice of deep projective 3D semantic segmentation includes the segmentation of a 3D point cloud where images are generated from the point cloud input. And each image corresponds to a different view in the point cloud. A point splatting method is used to create these images. The 2D representations (images) are then segmented, and these segmentation labels are reprojected to the 3D point cloud. Then, the 3D point cloud to 2D images for segmentation, then back to 3D point cloud with segmentation labels.
US8284190B2 discloses a registration of street-level imagery to 3D building models that corrects the origin point (camera coordinates) of a 2D street view image by optimizing a cost function based on the alignment of the edges of projected 2D buildings to their 3D model counterparts. It involves the extraction of building features from 2D street view images and its projection to their respective 3D models. Specifically, 3D LIDAR data along with a LIDAR edge detection method is used to identify the building edges and skyline in the 2D image. After projecting the 2D features to the 3D model, distance error between edges is used for the cost function to regress and correct the camera coordinates. Custom street view images, 3D building models, 3D LIDAR data. The building Edges and Skyline (separation between top of the building and the sky).
US10643380B2 discloses a generating multi-dimensional building models with ground level images wherein a 3D point cloud is created from the ground-level images covering multiple building views. Vertices that correspond to building edges are manually/semi-automatically labelled in the 3D point cloud, such that these vertices form the edges to a planar surface. Surfaces are used to create simple facade geometry, and are textured. Non-edge points are correlated to planar surfaces, and surfaces are adjusted to fit to the correlated points. Surfaces are used to reconstruct a textured, 3D building model. Orthogonal, ground-level images. Manual/semi-automatic selection of edges, planar surfaces.
US 2001/0038718 A1 discloses a method and apparatus for performing geo-spatial registration of imagery. The system and method for accurately mapping between image coordinates and geo-coordinates, called geo-spatial registration. The system utilizes the imagery and terrain information contained in the geo-spatial database to precisely align geodetically calibrated reference imagery with an input image, e.g., dynamically generated video images, and thus achieve a high accuracy identification of locations within the scene. When a sensor, such as a video camera, images a scene contained in the geo-spatial database, the system recalls a reference image pertaining to the imaged scene. This reference image is aligned very accurately with the sensor's images using a parametric transformation. Thereafter, other information that is associated with the reference image can easily be overlaid upon or otherwise associated with the sensor imagery. However, US 2001/0038718 A1, failed to disclose an automated semantic segmentation of features and microstructures and buildings with particular interest in microstructures, e.g. pillars, stairs, doors that are part of the building.
Related art is unable to semantically label 3D building models through 2D to 3D projection of street view images. Their methods also lack focus on semantically labelling microstructures on buildings, such as windows or doors. Other, related art and related work use 3D LIDAR data and 3D point clouds which aren't as readily available as 2D street view images and 3D shapefiles. Obtaining LIDAR data requires the use of aerial drones and LIDAR sensors, while street view images of buildings can be taken by a street-level, geolocated camera. Further, related art uses heuristics-guided methods for extracting the building from the 2D street view images, in this case primarily using building edges and skylines while related art does not address problem cases where occluders, such as trees or electricity posts, obscure the view of the buildings in the image.
It is therefore a principal object of the subject invention to overcome the aforementioned drawbacks by the cited related arts by providing a method which includes, but is not limited to, defined microstructures present in the building, and unlabelled or initially unseen faces of the initial model, in order to provide a more complete set of semantic labels for 3D models.
The subject invention discloses the use of maps in street view images, and 3D building map vendor geodata; automated semantic segmentation of features and microstructures, Buildings with particular interest in microstructures, e.g. pillars, stairs, doors that are part of the building, occlusions are removed by way of inpainting, and 2D images and building map for segmentation then projection to 3D mesh with segmentation labels.
The subject invention can semantically segment images into building parts including occluded regions. It can also project the 2D semantic segmentation labels to the 3D models and post-process initial semantic label projection, which includes, but is not limited to, defined microstructures present in the building, and unlabelled or initially unseen faces of the initial model, in order to provide a more complete set of semantic labels for 3D models.
The subject invention can be used to create semantically labelled 3D models for 5G wave propagation modeling and other telecommunications planning tasks. Based on initial results, the method is able to project semantic labels from street view images to 3D models of buildings using only shapefiles with building elevation.
The labelled 3D models output of the subject invention further can be used to estimate construction supplies needed by cities in the future as parts of the buildings such as windows and doors have certain lifetime.
The subject invention can be used for real estate trends wherein architectural trends and age of buildings can be inferred from the labelled 3D models.
The labelled 3D models created by the subject invention further can be used to enforcing building standards in different cities.
According to an aspect of the disclosure, a method of creating semantic 3D building augmentation is provided. The method may include acquiring shapefile of an area and street view images taken in that area. The method may include converting the shapefile to a triangular mesh and computing camera parameters. The method may include extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation. The method may include projecting the 2D semantic segmentation labels to 3D models. The method may include post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
According to an aspect of the disclosure, an electronic device including at least one memory configured to store instructions, and at least one processor is provided. The at least one processor may be configured, when executing the instructions, to acquire shapefile of an area and street view images taken in that area. The at least one processor may be configured to convert the shapefile to a triangular mesh and computing camera parameters. The at least one processor may be configured to extract from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation. The at least one processor may be configured to project the 2D semantic segmentation labels to 3D models. The at least one processor may be configured to post-process of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
According to an aspect of the disclosure, a machine-readable medium containing instructions that, when executed, cause at least one processor of an electronic device is provided. The machine-readable medium may cause at least one processor of an electronic device to obtain multiple video frames of a video stream and multiple depth frames corresponding to the multiple video frames. The machine-readable medium may cause at least one processor of an electronic device to generate multiple blur kernel maps based on the multiple depth frames. The machine-readable medium may cause at least one processor of an electronic device to reduce depth errors in each of the multiple blur kernel maps. The machine-readable medium may cause at least one processor of an electronic device to perform temporal smoothing on the multiple blur kernel maps to suppress temporal artifacts between different ones of the multiple blur kernel maps. The machine-readable medium may cause at least one processor of an electronic device to generate blur effects in the video stream using the multiple blur kernel maps.
Other objects and advantages of the present invention will become apparent upon reading of the detailed description taken together with the accompanying drawings.
Figure 1 is a flow diagram illustrating an embodiment of a method of creating semantic 3D building augmentation.
Figure 2 is a block diagram with illustrative views of the 2D semantic segmentation.
Figure 3 is an example block diagram of the method of 2D to 3D projection.
Figure 4 is an example block diagram of the method of the post-processing.
Figure 5 is an illustrative sample of 2D semantic segmentation.
Figure 6 is an illustrative sample of house detection.
Figure 7 is an illustrative sample results of shapefile splitting and matching of street view images.
Figure 8 is an illustrative sample results of 2D Projection of texture and semantic label pixels.
Figure 9 is an illustrative sample results of post-processing.
Figure 10 is a block diagram of an electronic device according to embodiments; and
Figure 11 is a flowchart illustrating a method of creating semantic 3D building augmentation according to an embodiment of the disclosure.
The following detailed description should be read with reference to the appended drawings, in which like elements in different drawings are numbered identically. It will be understood that embodiments shown in the drawings and described herein are merely for illustrative purposes and are not intended to limit the application to any embodiment. On the contrary it is intended to cover alternatives, modifications, and equivalents as may be included within the scope of the application as defined by the appended claims.
The method of creating semantic 3D building augmentation 100 comprising the steps of acquiring input data comprising of shapefile of an area and street view images taken in that area, comprising of shapefile of building elevation 101, camera location and field of view 102, and street view images 103. The shapefile of building elevation 101 being converted to a triangular mesh 104 while camera parameters from the camera location and field of view 102 being computed which includes the camera intrinsic and extrinsic 105. Using automated 2D semantic segmentation 106, the pixelwise location of the building and its features including occluded regions by artifacts such as trees, people, cars, etc., being extracted from street view images 103. The result of the triangular mesh 104 and extracted street view images 103 being projected into a 2D semantic segmentation labels to 3D models 107 which undergoes post-processing 108 of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models 109.
As shown in Figure 2, the 2D semantic segmentation 106 is used to extract from street view images the pixelwise location of the building and its features. Ideally, this can be done directly by training a network directly from the dataset and using the output of the network as labelled image. However, in most cases, buildings and houses are occluded in the image by artifacts such as trees, people, cars, etc. This will result in a loss of some information which may be relevant to the final 3D model output which is the whole house and the microstructures that can be projected to 3D without the occlusions.
The 2D semantic segmentation 106, further comprising of the following steps: generating mask and masked image of occluded regions 111 by predicting walls and windows as parts of the buildings, and occluded regions from the semantic segmentation from base image 110 using PSPNet such as disclosed by Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881-2890) https://arxiv.org/abs/1612.01105. Then after, inpainting 112 of the possible parts of the building that were blocked by the occluded regions using DeepFill such as disclosed by Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. S. (2019). Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4471-4480), https://arxiv.org/abs/1806.03589. The inpainted image 112 being subjected to semantic segmentation 113 and the bounding box for house detection being generated 114 wherein the detected houses and buildings adapted to isolate from the label image the building of interest using Faster-RCNN such as disclosed by Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99), https://arxiv.org/abs/1506.01497.
As shown in Figure 3, the 2D semantic segmentation labels to 3D models projection 107, comprising the steps of matching building triangular mesh to their corresponding images 115, projecting 2D semantic labels to the building triangular mesh 116 using pinhole camera model, and processing pose correction 117 adapted to handle errors using the building bounding box.
The post-processing augments the initial labelled 3D model produced by 2D-3D projection which could be incompletely labelled. The Post-Processing block uses heuristics based on assumptions for buildings to complete the labels on the 3D model. As shown in Figure 4, the post-processing 108 is comprising capturing mesh views 118 where each side of the initially labelled 3D model are rendered to images, 2D post-processing 119 where labels are being completed. It is further divided into the following processes: pre-processing 120 which involves extracting information about the microstructures, such as but not limited to windows and doors, and correcting rendering errors, and view processing 121 which involves the label completion of each generated view. The horizontal boundary between the wall and the roof is found, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls.
The final processing 122 is used to complete labels using all views as a whole. This includes but not limited to asserting wall continuity across views while the label reprojection 123 where final 2D labels are reprojected back to the 3D model.
In the samples provided in Figure 5, the input image 200 being processed using automated 2D semantic segmentation 10 rendering a segmented view 201. In Figure 6, the input image 202 generate bounding box for house detection 203 also using automated 2D semantic segmentation 106. In Figure 7, the separated 3D object 204 renders shapefile splitting and matching of street view images 205. On Figure 8, the 2D projection 206a, 206b render texture and semantic label pixels 207a, 207b, respectively, through 2D to 3D projection process 107. Lastly, referring to Figure 9, the post-processing 108, renders the following: projected output 208 which includes input to post process, the output after capturing 209 each side of mesh, output after pre-processing 210, output after propagating the walls 211 and lastly, re-projected final output 212.
The re-projected final output 212 can be used to create semantically labelled 3D models for 5G wave propagation modeling and other telecommunications planning tasks. Based on initial results, the method is able to project semantic labels from street view images to 3D models of buildings using only shapefiles with building elevation. It can be used to estimate supplies needed by cities in the future as parts of the buildings such as windows and doors have certain lifetime. The architectural trends and age of buildings can be inferred from the labelled 3D models. These can then be used for real estate trends. Lastly, the labelled 3D models created can be used to enforcing building standards in different cities.
Figure 10 is a block diagram of an electronic device 1000 according to embodiments of the disclosure.
Figure. 10 is for illustration only, and other embodiments of the electronic device 1000 could be used without departing from the scope of this disclosure. For example, the electronic device 1000 may not include some of the illustrated components (e.g., interface 1400, display 1500, or etc.), or may additionally include other components.
The electronic device 1000 includes a bus 1010, a processor 1020, a memory 1030, an interface 1040, and a display 1050.
The bus 1010 includes a circuit for connecting the components 1020 to 1050 with one another. The bus 1010 functions as a communication system for transferring data between the components 1020 to 1050 or between electronic devices.
The processor 1020 includes one or more of a central processing unit (CPU), a graphics processor unit (GPU), an accelerated processing unit (APU), a many integrated core (MIC), a field-programmable gate array (FPGA), or a digital signal processor (DSP). The processor 1020 is able to perform control of any one or any combination of the other components of the electronic device 1000, and/or perform an operation or data processing relating to communication. The processor 1020 executes one or more programs stored in the memory 1030.
The memory 1030 may include a volatile and/or non-volatile memory. The memory 1030 stores information, such as one or more of commands, data, programs (one or more instructions), applications 1034, etc., which are related to at least one other component of the electronic device 1000 and for driving and controlling the electronic device 1000. For example, commands and/or data may formulate an operating system (OS) 1032. Information stored in the memory 1030 may be executed by the processor 1020.
The applications 1034 include the above-discussed embodiments. These functions can be performed by a single application or by multiple applications that each carry out one or more of these functions.
The display 1050 includes, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a quantum-dot light emitting diode (QLED) display, a microelectromechanical systems (MEMS) display, or an electronic paper display. The display 1050 can also be a depth-aware display, such as a multi-focal display. The display 1050 is able to present, for example, various contents, such as text, images, videos, icons, and symbols.
The interface 1040 may include input/output (I/O) interface 1042, communication interface 1044, and/or one or more sensors 1046. The I/O interface 1042 serves as an interface that can, for example, transfer commands and/or data between a user and/or other external devices and other component(s) of the electronic device 1000.
The sensor(s) 1046 can meter a physical quantity or detect an activation state of the electronic device 1000 and convert metered or detected information into an electrical signal. For example, the sensor(s) 1046 can include one or more cameras or other imaging sensors for capturing images of scenes. The sensor(s) 1046 can also include any one or any combination of a microphone, a keyboard, a mouse, one or more buttons for touch input, a gyroscope or gyro sensor, an air pressure sensor, a magnetic sensor or magnetometer, an acceleration sensor or accelerometer, a grip sensor, a proximity sensor, a color sensor (such as a red green blue (RGB) sensor), a bio-physical sensor, a temperature sensor, a humidity sensor, an illumination sensor, an ultraviolet (UV) sensor, an electromyography (EMG) sensor, an electroencephalogram (EEG) sensor, an electrocardiogram (ECG) sensor, an infrared (IR) sensor, an ultrasound sensor, an iris sensor, and a fingerprint sensor. The sensor(s) 1046 can further include an inertial measurement unit. In addition, the sensor(s) 1046 can include a control circuit for controlling at least one of the sensors included herein. Any of these sensor(s) 1046 can be located within or coupled to the electronic device 1000. The sensors 1046 may be used to detect touch input, gesture input, and hovering input, using an electronic pen or a body portion of a user, etc.
The communication interface 1044, for example, is able to set up communication between the electronic device 1000 and an external electronic device, or a server, the communication interface 1044 can be connected with a network through wireless or wired communication architecture to communicate with the external electronic device. The communication interface 1044 can be a wired or wireless transceiver or any other component for transmitting and receiving signals.
Figure 11. is a flowchart illustrating a method of creating semantic 3D building augmentation according to an embodiment of the disclosure.
The method 1100 may be performed by at least one processor using the electronic 1000 of Figure 1000.
As shown in Figure 11, in operation 1110, the method 1100 includes acquiring shapefile of an area and street view images taken in that area.
In operation 1120, the method 1100 includes converting the shapefile to a triangular mesh and computing camera parameters.
In operation 1130, the method 1100 includes extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation. The occluded regions from street view images are extracted using inpainting of occlusions. The automated 2D semantic segmentation includes the steps of predicting walls and windows that are part of buildings, predicting occluders, such as trees and cars, that obstruct parts of buildings, inpainting possible parts of the building blocked by the occluders, and detecting houses and buildings adapted to isolate from the label image the building of interest.
In operation 1140, the method 1100 includes projecting the 2D semantic segmentation labels to 3D models. The projecting the 2D semantic segmentation labels to 3D models projection includes the steps of matching building triangular mesh to their corresponding images, projecting 2D semantic labels to the building triangular mesh using pinhole camera model, and processing pose correction adapted to handle errors using the building bounding box.
In operation 1150, the method 1100 includes post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models. The post-processing, includes the steps of capturing mesh views adapted for each side of the initially labelled 3D model are rendered to images, 2D post-processing adapted for completing the labels, and reprojecting back the final 2D labels to the 3D model. The post-processing includes the steps of extracting information about the microstructures, and correcting rendering errors, view processing comprising of label completion of each generated view, the horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls, and completing labels using all views, it includes asserting wall continuity across views.
According to an aspect of the disclosure, a method of creating semantic 3D building augmentation is provided. The method may include acquiring shapefile of an area and street view images taken in that area. The method may include converting the shapefile to a triangular mesh and computing camera parameters. The method may include extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation. The method may include projecting the 2D semantic segmentation labels to 3D models. The method may include post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
According to an embodiment of the disclosure, the occluded regions from street view images may be extracted using inpainting of occlusions.
According to an embodiment of the disclosure, the method may include predicting walls and windows that are part of buildings. The method may include predicting occluders, such as trees and cars, that obstruct parts of buildings. The method may include inpainting possible parts of the building blocked by the occluders. The method may include detecting houses and buildings adapted to isolate from the label image the building of interest.
According to an embodiment of the disclosure, the method may include matching building triangular mesh to their corresponding images. The method may include projecting 2D semantic labels to the building triangular mesh using pinhole camera model. The method may include processing pose correction adapted to handle errors using the building bounding box.
According to an embodiment of the disclosure, the method may include capturing mesh views adapted for each side of the initially labelled 3D model are rendered to images. The method may include 2D post-processing adapted for completing the labels. The method may include reprojecting back the final 2D labels to the 3D model.
According to an embodiment of the disclosure, the method may include extracting information about the microstructures, and correcting rendering errors. The method may include view processing comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls. The method may include completing labels using all views; wherein it includes asserting wall continuity across views.
According to an aspect of the disclosure, an electronic device including at least one memory configured to store instructions, and at least one processor is provided. The at least one processor may be configured, when executing the instructions, to acquire shapefile of an area and street view images taken in that area. The at least one processor may be configured to convert the shapefile to a triangular mesh and computing camera parameters. The at least one processor may be configured to extract from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation. The at least one processor may be configured to project the 2D semantic segmentation labels to 3D models. The at least one processor may be configured to post-process of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
According to an embodiment, the occluded regions from street view images may be extracted using inpainting of occlusions.
According to an embodiment, the at least one processor may be configured to predict walls and windows that are part of buildings. The at least one processor may be configured to predict occluders, such as trees and cars, that obstruct parts of buildings. The at least one processor may be configured toinpaint possible parts of the building blocked by the occluders. The at least one processor may be configured to detect houses and buildings adapted to isolate from the label image the building of interest.
According to an embodiment, the at least one processor may be configured to match building triangular mesh to their corresponding images. The at least one processor may be configured to project 2D semantic labels to the building triangular mesh using pinhole camera model. The at least one processor may be configured to process pose correction adapted to handle errors using the building bounding box.
According to an embodiment, the at least one processor may be configured to capture mesh views adapted for each side of the initially labelled 3D model are rendered to images. The at least one processor may be configured to 2D post-processing adapted for completing the labels. The at least one processor may be configured to reproject back the final 2D labels to the 3D model.
According to an embodiment, the at least one processor may be configured to extract information about the microstructures, and correcting rendering errors. The at least one processor may be configured to view process comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls. The at least one processor may be configured to complete labels using all views; wherein it includes asserting wall continuity across views.
According to an aspect of the disclosure, a machine-readable medium containing instructions that, when executed, cause at least one processor of an electronic device is provided. The machine-readable medium may cause at least one processor of an electronic device to obtain multiple video frames of a video stream and multiple depth frames corresponding to the multiple video frames. The machine-readable medium may cause at least one processor of an electronic device to generate multiple blur kernel maps based on the multiple depth frames. The machine-readable medium may cause at least one processor of an electronic device to reduce depth errors in each of the multiple blur kernel maps. The machine-readable medium may cause at least one processor of an electronic device to perform temporal smoothing on the multiple blur kernel maps to suppress temporal artifacts between different ones of the multiple blur kernel maps. The machine-readable medium may cause at least one processor of an electronic device to generate blur effects in the video stream using the multiple blur kernel maps.
According to an embodiment of the disclosure, the occluded regions from street view images may be extracted using inpainting of occlusions.
According to an embodiment, the machine-readable medium may cause at least one processor of an electronic device to predict walls and windows that are part of buildings. The machine-readable medium may cause at least one processor of an electronic device to predict occluders, such as trees and cars, that obstruct parts of buildings. The machine-readable medium may cause at least one processor of an electronic device toinpaint possible parts of the building blocked by the occluders. The machine-readable medium may cause at least one processor of an electronic device to detect houses and buildings adapted to isolate from the label image the building of interest.
According to an embodiment, the machine-readable medium may cause at least one processor of an electronic device to match building triangular mesh to their corresponding images. The machine-readable medium may cause at least one processor of an electronic device to project 2D semantic labels to the building triangular mesh using pinhole camera model. The machine-readable medium may cause at least one processor of an electronic device to process pose correction adapted to handle errors using the building bounding box.
According to an embodiment, the machine-readable medium may cause at least one processor of an electronic device to capture mesh views adapted for each side of the initially labelled 3D model are rendered to images. The machine-readable medium may cause at least one processor of an electronic device to 2D post-processing adapted for completing the labels. The machine-readable medium may cause at least one processor of an electronic device to reproject back the final 2D labels to the 3D model.
According to an embodiment, the machine-readable medium may cause at least one processor of an electronic device to extract information about the microstructures, and correcting rendering errors. The machine-readable medium may cause at least one processor of an electronic device to view process comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls. The machine-readable medium may cause at least one processor of an electronic device to complete labels using all views; wherein it includes asserting wall continuity across views.
Claims (15)
- A method of creating semantic 3D building augmentation, the method comprising:acquiring shapefile of an area and street view images taken in that area;converting the shapefile to a triangular mesh and computing camera parameters;extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation;projecting the 2D semantic segmentation labels to 3D models; andpost-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
- The method of creating semantic 3D building augmentation in accordance to claim 1, wherein the occluded regions from street view images are extracted using inpainting of occlusions.
- The method of creating semantic 3D building augmentation in accordance to any one of claims 1 to 2, wherein said automated 2D semantic segmentation comprises the steps of:predicting walls and windows that are part of buildings;predicting occluders, such as trees and cars, that obstruct parts of buildings;inpainting possible parts of the building blocked by the occluders; anddetecting houses and buildings adapted to isolate from the label image the building of interest.
- The method of creating semantic 3D building augmentation in accordance to any one of claims 1 to 3, wherein the projecting the 2D semantic segmentation labels to 3D models projection, comprising the steps of:matching building triangular mesh to their corresponding images;projecting 2D semantic labels to the building triangular mesh using pinhole camera model; andprocessing pose correction adapted to handle errors using the building bounding box.
- The method of creating semantic 3D building augmentation in accordance to any one of claims 1 to 4, wherein said post-processing, comprising the steps of:capturing mesh views adapted for each side of the initially labelled 3D model are rendered to images;2D post-processing adapted for completing the labels; andreprojecting back the final 2D labels to the 3D model.
- The method of creating semantic 3D building augmentation in accordance to claim 5, wherein said 2D post-processing further comprising the steps of:extracting information about the microstructures, and correcting rendering errors;view processing comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls; andcompleting labels using all views; wherein it includes asserting wall continuity across views.
- An electronic device comprising:at least one memory configured to store instructions; andat least one processor configured, when executing the instructions, to:acquire shapefile of an area and street view images taken in that area;convert the shapefile to a triangular mesh and computing camera parameters;extract from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation;project the 2D semantic segmentation labels to 3D models; andpost-process of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
- The electronic device of Claim 7, wherein the occluded regions from street view images are extracted using inpainting of occlusions.
- The electronic device of any one of Claims 7 to 8, the at least one processor configured to:predict walls and windows that are part of buildings;predict occluders, such as trees and cars, that obstruct parts of buildings;inpaint possible parts of the building blocked by the occluders; anddetect houses and buildings adapted to isolate from the label image the building of interest.
- The electronic device of any one of Claims 7 to 9, wherein the at least one processor configured to:match building triangular mesh to their corresponding images;project 2D semantic labels to the building triangular mesh using pinhole camera model; andprocess pose correction adapted to handle errors using the building bounding box.
- The electronic device of any one of Claims 7 to 10, wherein the at least one processor configured to:capture mesh views adapted for each side of the initially labelled 3D model are rendered to images;2D post-processing adapted for completing the labels; andreproject back the final 2D labels to the 3D model.
- The electronic device of Claim 11, wherein the at least one processor configured to:extract information about the microstructures, and correcting rendering errors;view process comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls; andcomplete labels using all views; wherein it includes asserting wall continuity across views.
- A machine-readable medium containing instructions that, when executed, cause at least one processor of an electronic device to:obtain multiple video frames of a video stream and multiple depth frames corresponding to the multiple video frames;generate multiple blur kernel maps based on the multiple depth frames;reduce depth errors in each of the multiple blur kernel maps;perform temporal smoothing on the multiple blur kernel maps to suppress temporal artifacts between different ones of the multiple blur kernel maps; andgenerate blur effects in the video stream using the multiple blur kernel maps.
- The machine-readable medium of Claim 13, cause at least one processor of an electronic device to:predict walls and windows that are part of buildings;predict occluders, such as trees and cars, that obstruct parts of buildings;inpaint possible parts of the building blocked by the occluders; anddetect houses and buildings adapted to isolate from the label image the building of interest.
- The machine-readable medium of any one of Claims 13 to 14, cause at least one processor of an electronic device to:match building triangular mesh to their corresponding images;project 2D semantic labels to the building triangular mesh using pinhole camera model; andprocess pose correction adapted to handle errors using the building bounding box.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PH1-2021-050443 | 2021-09-10 | ||
| PH1/2021/050443A PH12021050443A1 (en) | 2021-09-10 | 2021-09-10 | Semantic three-dimensional (3d) building augmentation |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023038369A1 true WO2023038369A1 (en) | 2023-03-16 |
Family
ID=85507408
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/KR2022/013187 Ceased WO2023038369A1 (en) | 2021-09-10 | 2022-09-02 | Semantic three-dimensional (3d) building augmentation |
Country Status (2)
| Country | Link |
|---|---|
| PH (1) | PH12021050443A1 (en) |
| WO (1) | WO2023038369A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117351173A (en) * | 2023-12-06 | 2024-01-05 | 北京飞渡科技股份有限公司 | Three-dimensional building parameterization modeling method and device based on text driving |
| CN119359937A (en) * | 2024-12-26 | 2025-01-24 | 四川省地质调查研究院测绘地理信息中心 | Building boundary recognition method based on real-scene 3D technology and multi-source fusion data |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102167835B1 (en) * | 2019-05-08 | 2020-10-20 | 주식회사 카카오 | Apparatus and method of processing image |
| US20210049372A1 (en) * | 2019-08-12 | 2021-02-18 | Naver Labs Corporation | Method and system for generating depth information of street view image using 2d map |
-
2021
- 2021-09-10 PH PH1/2021/050443A patent/PH12021050443A1/en unknown
-
2022
- 2022-09-02 WO PCT/KR2022/013187 patent/WO2023038369A1/en not_active Ceased
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| KR102167835B1 (en) * | 2019-05-08 | 2020-10-20 | 주식회사 카카오 | Apparatus and method of processing image |
| US20210049372A1 (en) * | 2019-08-12 | 2021-02-18 | Naver Labs Corporation | Method and system for generating depth information of street view image using 2d map |
Non-Patent Citations (3)
| Title |
|---|
| ABHIJIT KUNDU; XIAOQI YIN; ALIREZA FATHI; DAVID ROSS; BRIAN BREWINGTON; THOMAS FUNKHOUSER; CAROLINE PANTOFARU: "Virtual Multi-view Fusion for 3D Semantic Segmentation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 July 2020 (2020-07-26), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081727376 * |
| MASCARO RUBEN; TEIXEIRA LUCAS; CHLI MARGARITA: "Diffuser: Multi-View 2D-to-3D Label Diffusion for Semantic Scene Segmentation", 2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE, 30 May 2021 (2021-05-30), pages 13589 - 13595, XP033990003, DOI: 10.1109/ICRA48506.2021.9561801 * |
| PIERFRANCESCO ARDINO; YAHUI LIU; ELISA RICCI; BRUNO LEPRI; MARCO DE NADAI: "Semantic-Guided Inpainting Network for Complex Urban Scenes Manipulation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 October 2020 (2020-10-19), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081788701 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117351173A (en) * | 2023-12-06 | 2024-01-05 | 北京飞渡科技股份有限公司 | Three-dimensional building parameterization modeling method and device based on text driving |
| CN117351173B (en) * | 2023-12-06 | 2024-03-19 | 北京飞渡科技股份有限公司 | Three-dimensional building parameterization modeling method and device based on text driving |
| CN119359937A (en) * | 2024-12-26 | 2025-01-24 | 四川省地质调查研究院测绘地理信息中心 | Building boundary recognition method based on real-scene 3D technology and multi-source fusion data |
Also Published As
| Publication number | Publication date |
|---|---|
| PH12021050443A1 (en) | 2023-03-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11762475B2 (en) | AR scenario-based gesture interaction method, storage medium, and communication terminal | |
| CN112793564B (en) | Autonomous parking auxiliary system based on panoramic aerial view and deep learning | |
| CN111126304B (en) | Augmented reality navigation method based on indoor natural scene image deep learning | |
| CN107292965B (en) | Virtual and real shielding processing method based on depth image data stream | |
| CN108648194B (en) | Method and device for 3D target recognition, segmentation and pose measurement based on CAD model | |
| CN110717494B (en) | 3D Reconstruction and Semantic Segmentation Method of Android Mobile Indoor Scene | |
| KR101595537B1 (en) | Networked capture and 3d display of localized, segmented images | |
| US7965904B2 (en) | Position and orientation measuring apparatus and position and orientation measuring method, mixed-reality system, and computer program | |
| CN104330074B (en) | Intelligent surveying and mapping platform and realizing method thereof | |
| CN115861601B (en) | Multi-sensor fusion sensing method and device | |
| CN114677435B (en) | A method and system for extracting point cloud panoramic fusion elements | |
| WO2019161813A1 (en) | Dynamic scene three-dimensional reconstruction method, apparatus and system, server, and medium | |
| CN113610889A (en) | Human body three-dimensional model obtaining method and device, intelligent terminal and storage medium | |
| WO2023093217A1 (en) | Data labeling method and apparatus, and computer device, storage medium and program | |
| Yu et al. | Intelligent visual-IoT-enabled real-time 3D visualization for autonomous crowd management | |
| CN108256504A (en) | A kind of Three-Dimensional Dynamic gesture identification method based on deep learning | |
| CN112233221A (en) | 3D map reconstruction system and method based on real-time positioning and map construction | |
| CN112562056B (en) | Control method, device, medium and equipment for virtual lighting in virtual studio | |
| CN103839277A (en) | Mobile augmented reality registration method of outdoor wide-range natural scene | |
| CN113378605A (en) | Multi-source information fusion method and device, electronic equipment and storage medium | |
| CN109961522A (en) | Image projection method, apparatus, device and storage medium | |
| WO2023038369A1 (en) | Semantic three-dimensional (3d) building augmentation | |
| CN110858414A (en) | Image processing method and device, readable storage medium and augmented reality system | |
| CN111476907A (en) | Device and method for positioning and three-dimensional scene reconstruction based on virtual reality technology | |
| CN113379748A (en) | Point cloud panorama segmentation method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22867638 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22867638 Country of ref document: EP Kind code of ref document: A1 |