WO2023038369A1

WO2023038369A1 - Semantic three-dimensional (3d) building augmentation

Info

Publication number: WO2023038369A1
Application number: PCT/KR2022/013187
Authority: WO
Inventors: Shakira ARGUELLES; James Carl H NECIO; Soonyoung LEE; Wonkyun PARK; Rowel O ATIENZA; Daryl L PERALTA; Ferdinand John S BRIONES; Raimarc S DIONIDO; Izza Claire M. JALANDONI; Jonric A MIRANDO; Rangel DG DAROYA
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-09-10
Filing date: 2022-09-02
Publication date: 2023-03-16
Anticipated expiration: 2024-03-10
Also published as: PH12021050443A1

Abstract

The subject invention discloses a method to semantically label 3D models of buildings from the shape file of an area and street view images taken in that area. The invention further can semantically segment images into building parts including occluded regions. Moreover, the invention can project the 2D semantic segmentation labels to the 3D models.

Description

SEMANTIC THREE-DIMENSIONAL (3D) BUILDING AUGMENTATION

The disclosure of the invention is generally related to image processing systems and, more specifically, to a system for performing semantic three-dimensional (3D) building augmentation.

With the advent of 5G (fifth-generation) networks, there is a need for 3D models of buildings with semantic information for wave propagation modeling. The disclosure of the invention may provide a method that creates semantically labelled 3D models from geo-located street view images and geolocation-related data of buildings with building elevation data of entire cities. These data may include the geolocation and the general shape of the building (e.g., the height and building footprint shapefiles).

Related arts used 3D LIDAR data and point cloud data as input to create semantically labelled 3D models which are not as readily available compared to street view images and shapefiles. Related arts also did not look into segmenting microstructures in the building such as windows and doors which are important to create accurate 3D models for simulations.

Lastly, no related art addressed the issue that arises when objects and structures such as trees and posts occlude the building.

Creating segmented 3D buildings of entire cities can be used not only for telecommunications planning but also for urban planning, autonomous vehicle navigation, indoor robot navigation, noise propagation simulation, solar radiation calculation, real estate trends, construction supplies demand estimation, and enforcing building standards.

Current semantic 3D reconstruction with learning multi-view stereo and 2D segmentation of aerial images includes paper outlines a pipeline for constructing a 3D point cloud from a set of 2D images. From the input images, the following are acquired: 2D segmentation of the images, an estimation of the camera source location, and an estimation of the depths of objects of interest in the images. An initial point cloud is generated by combining the 2D segmentation results and depth maps. Label assignment on the point cloud is made through multi-view consistency. In order to remove noises from the point cloud, post-processing is done with the use of a graph-based method to establish connected points. Drone-captured images of vegetation, building, roads, vehicles, and background. The occlusions are dealt with by comparing image depth maps with other nearby depth maps, and their corresponding 2D segmentation results.

Another practice of deep projective 3D semantic segmentation includes the segmentation of a 3D point cloud where images are generated from the point cloud input. And each image corresponds to a different view in the point cloud. A point splatting method is used to create these images. The 2D representations (images) are then segmented, and these segmentation labels are reprojected to the 3D point cloud. Then, the 3D point cloud to 2D images for segmentation, then back to 3D point cloud with segmentation labels.

US8284190B2 discloses a registration of street-level imagery to 3D building models that corrects the origin point (camera coordinates) of a 2D street view image by optimizing a cost function based on the alignment of the edges of projected 2D buildings to their 3D model counterparts. It involves the extraction of building features from 2D street view images and its projection to their respective 3D models. Specifically, 3D LIDAR data along with a LIDAR edge detection method is used to identify the building edges and skyline in the 2D image. After projecting the 2D features to the 3D model, distance error between edges is used for the cost function to regress and correct the camera coordinates. Custom street view images, 3D building models, 3D LIDAR data. The building Edges and Skyline (separation between top of the building and the sky).

US10643380B2 discloses a generating multi-dimensional building models with ground level images wherein a 3D point cloud is created from the ground-level images covering multiple building views. Vertices that correspond to building edges are manually/semi-automatically labelled in the 3D point cloud, such that these vertices form the edges to a planar surface. Surfaces are used to create simple facade geometry, and are textured. Non-edge points are correlated to planar surfaces, and surfaces are adjusted to fit to the correlated points. Surfaces are used to reconstruct a textured, 3D building model. Orthogonal, ground-level images. Manual/semi-automatic selection of edges, planar surfaces.

US 2001/0038718 A1 discloses a method and apparatus for performing geo-spatial registration of imagery. The system and method for accurately mapping between image coordinates and geo-coordinates, called geo-spatial registration. The system utilizes the imagery and terrain information contained in the geo-spatial database to precisely align geodetically calibrated reference imagery with an input image, e.g., dynamically generated video images, and thus achieve a high accuracy identification of locations within the scene. When a sensor, such as a video camera, images a scene contained in the geo-spatial database, the system recalls a reference image pertaining to the imaged scene. This reference image is aligned very accurately with the sensor's images using a parametric transformation. Thereafter, other information that is associated with the reference image can easily be overlaid upon or otherwise associated with the sensor imagery. However, US 2001/0038718 A1, failed to disclose an automated semantic segmentation of features and microstructures and buildings with particular interest in microstructures, e.g. pillars, stairs, doors that are part of the building.

Related art is unable to semantically label 3D building models through 2D to 3D projection of street view images. Their methods also lack focus on semantically labelling microstructures on buildings, such as windows or doors. Other, related art and related work use 3D LIDAR data and 3D point clouds which aren't as readily available as 2D street view images and 3D shapefiles. Obtaining LIDAR data requires the use of aerial drones and LIDAR sensors, while street view images of buildings can be taken by a street-level, geolocated camera. Further, related art uses heuristics-guided methods for extracting the building from the 2D street view images, in this case primarily using building edges and skylines while related art does not address problem cases where occluders, such as trees or electricity posts, obscure the view of the buildings in the image.

It is therefore a principal object of the subject invention to overcome the aforementioned drawbacks by the cited related arts by providing a method which includes, but is not limited to, defined microstructures present in the building, and unlabelled or initially unseen faces of the initial model, in order to provide a more complete set of semantic labels for 3D models.

The subject invention discloses the use of maps in street view images, and 3D building map vendor geodata; automated semantic segmentation of features and microstructures, Buildings with particular interest in microstructures, e.g. pillars, stairs, doors that are part of the building, occlusions are removed by way of inpainting, and 2D images and building map for segmentation then projection to 3D mesh with segmentation labels.

The subject invention can semantically segment images into building parts including occluded regions. It can also project the 2D semantic segmentation labels to the 3D models and post-process initial semantic label projection, which includes, but is not limited to, defined microstructures present in the building, and unlabelled or initially unseen faces of the initial model, in order to provide a more complete set of semantic labels for 3D models.

The subject invention can be used to create semantically labelled 3D models for 5G wave propagation modeling and other telecommunications planning tasks. Based on initial results, the method is able to project semantic labels from street view images to 3D models of buildings using only shapefiles with building elevation.

The labelled 3D models output of the subject invention further can be used to estimate construction supplies needed by cities in the future as parts of the buildings such as windows and doors have certain lifetime.

The subject invention can be used for real estate trends wherein architectural trends and age of buildings can be inferred from the labelled 3D models.

The labelled 3D models created by the subject invention further can be used to enforcing building standards in different cities.

According to an aspect of the disclosure, a method of creating semantic 3D building augmentation is provided. The method may include acquiring shapefile of an area and street view images taken in that area. The method may include converting the shapefile to a triangular mesh and computing camera parameters. The method may include extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation. The method may include projecting the 2D semantic segmentation labels to 3D models. The method may include post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.

According to an aspect of the disclosure, an electronic device including at least one memory configured to store instructions, and at least one processor is provided. The at least one processor may be configured, when executing the instructions, to acquire shapefile of an area and street view images taken in that area. The at least one processor may be configured to convert the shapefile to a triangular mesh and computing camera parameters. The at least one processor may be configured to extract from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation. The at least one processor may be configured to project the 2D semantic segmentation labels to 3D models. The at least one processor may be configured to post-process of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.

According to an aspect of the disclosure, a machine-readable medium containing instructions that, when executed, cause at least one processor of an electronic device is provided. The machine-readable medium may cause at least one processor of an electronic device to obtain multiple video frames of a video stream and multiple depth frames corresponding to the multiple video frames. The machine-readable medium may cause at least one processor of an electronic device to generate multiple blur kernel maps based on the multiple depth frames. The machine-readable medium may cause at least one processor of an electronic device to reduce depth errors in each of the multiple blur kernel maps. The machine-readable medium may cause at least one processor of an electronic device to perform temporal smoothing on the multiple blur kernel maps to suppress temporal artifacts between different ones of the multiple blur kernel maps. The machine-readable medium may cause at least one processor of an electronic device to generate blur effects in the video stream using the multiple blur kernel maps.

Other objects and advantages of the present invention will become apparent upon reading of the detailed description taken together with the accompanying drawings.

Figure 1 is a flow diagram illustrating an embodiment of a method of creating semantic 3D building augmentation.

Figure 2 is a block diagram with illustrative views of the 2D semantic segmentation.

Figure 3 is an example block diagram of the method of 2D to 3D projection.

Figure 4 is an example block diagram of the method of the post-processing.

Figure 5 is an illustrative sample of 2D semantic segmentation.

Figure 6 is an illustrative sample of house detection.

Figure 7 is an illustrative sample results of shapefile splitting and matching of street view images.

Figure 8 is an illustrative sample results of 2D Projection of texture and semantic label pixels.

Figure 9 is an illustrative sample results of post-processing.

Figure 10 is a block diagram of an electronic device according to embodiments; and

Figure 11 is a flowchart illustrating a method of creating semantic 3D building augmentation according to an embodiment of the disclosure.

The following detailed description should be read with reference to the appended drawings, in which like elements in different drawings are numbered identically. It will be understood that embodiments shown in the drawings and described herein are merely for illustrative purposes and are not intended to limit the application to any embodiment. On the contrary it is intended to cover alternatives, modifications, and equivalents as may be included within the scope of the application as defined by the appended claims.

The method of creating semantic 3D building augmentation 100 comprising the steps of acquiring input data comprising of shapefile of an area and street view images taken in that area, comprising of shapefile of building elevation 101, camera location and field of view 102, and street view images 103. The shapefile of building elevation 101 being converted to a triangular mesh 104 while camera parameters from the camera location and field of view 102 being computed which includes the camera intrinsic and extrinsic 105. Using automated 2D semantic segmentation 106, the pixelwise location of the building and its features including occluded regions by artifacts such as trees, people, cars, etc., being extracted from street view images 103. The result of the triangular mesh 104 and extracted street view images 103 being projected into a 2D semantic segmentation labels to 3D models 107 which undergoes post-processing 108 of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models 109.

As shown in Figure 2, the 2D semantic segmentation 106 is used to extract from street view images the pixelwise location of the building and its features. Ideally, this can be done directly by training a network directly from the dataset and using the output of the network as labelled image. However, in most cases, buildings and houses are occluded in the image by artifacts such as trees, people, cars, etc. This will result in a loss of some information which may be relevant to the final 3D model output which is the whole house and the microstructures that can be projected to 3D without the occlusions.

The 2D semantic segmentation 106, further comprising of the following steps: generating mask and masked image of occluded regions 111 by predicting walls and windows as parts of the buildings, and occluded regions from the semantic segmentation from base image 110 using PSPNet such as disclosed by Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881-2890) https://arxiv.org/abs/1612.01105. Then after, inpainting 112 of the possible parts of the building that were blocked by the occluded regions using DeepFill such as disclosed by Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., & Huang, T. S. (2019). Free-form image inpainting with gated convolution. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4471-4480), https://arxiv.org/abs/1806.03589. The inpainted image 112 being subjected to semantic segmentation 113 and the bounding box for house detection being generated 114 wherein the detected houses and buildings adapted to isolate from the label image the building of interest using Faster-RCNN such as disclosed by Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99), https://arxiv.org/abs/1506.01497.

As shown in Figure 3, the 2D semantic segmentation labels to 3D models projection 107, comprising the steps of matching building triangular mesh to their corresponding images 115, projecting 2D semantic labels to the building triangular mesh 116 using pinhole camera model, and processing pose correction 117 adapted to handle errors using the building bounding box.

The post-processing augments the initial labelled 3D model produced by 2D-3D projection which could be incompletely labelled. The Post-Processing block uses heuristics based on assumptions for buildings to complete the labels on the 3D model. As shown in Figure 4, the post-processing 108 is comprising capturing mesh views 118 where each side of the initially labelled 3D model are rendered to images, 2D post-processing 119 where labels are being completed. It is further divided into the following processes: pre-processing 120 which involves extracting information about the microstructures, such as but not limited to windows and doors, and correcting rendering errors, and view processing 121 which involves the label completion of each generated view. The horizontal boundary between the wall and the roof is found, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls.

The final processing 122 is used to complete labels using all views as a whole. This includes but not limited to asserting wall continuity across views while the label reprojection 123 where final 2D labels are reprojected back to the 3D model.

In the samples provided in Figure 5, the input image 200 being processed using automated 2D semantic segmentation 10 rendering a segmented view 201. In Figure 6, the input image 202 generate bounding box for house detection 203 also using automated 2D semantic segmentation 106. In Figure 7, the separated 3D object 204 renders shapefile splitting and matching of street view images 205. On Figure 8, the

2D projection

206a, 206b render texture and

semantic label pixels

207a, 207b, respectively, through 2D to 3D projection process 107. Lastly, referring to Figure 9, the post-processing 108, renders the following: projected output 208 which includes input to post process, the output after capturing 209 each side of mesh, output after pre-processing 210, output after propagating the walls 211 and lastly, re-projected final output 212.

The re-projected final output 212 can be used to create semantically labelled 3D models for 5G wave propagation modeling and other telecommunications planning tasks. Based on initial results, the method is able to project semantic labels from street view images to 3D models of buildings using only shapefiles with building elevation. It can be used to estimate supplies needed by cities in the future as parts of the buildings such as windows and doors have certain lifetime. The architectural trends and age of buildings can be inferred from the labelled 3D models. These can then be used for real estate trends. Lastly, the labelled 3D models created can be used to enforcing building standards in different cities.

Figure 10 is a block diagram of an electronic device 1000 according to embodiments of the disclosure.

Figure. 10 is for illustration only, and other embodiments of the electronic device 1000 could be used without departing from the scope of this disclosure. For example, the electronic device 1000 may not include some of the illustrated components (e.g., interface 1400, display 1500, or etc.), or may additionally include other components.

The electronic device 1000 includes a bus 1010, a processor 1020, a memory 1030, an interface 1040, and a display 1050.

The bus 1010 includes a circuit for connecting the components 1020 to 1050 with one another. The bus 1010 functions as a communication system for transferring data between the components 1020 to 1050 or between electronic devices.

The processor 1020 includes one or more of a central processing unit (CPU), a graphics processor unit (GPU), an accelerated processing unit (APU), a many integrated core (MIC), a field-programmable gate array (FPGA), or a digital signal processor (DSP). The processor 1020 is able to perform control of any one or any combination of the other components of the electronic device 1000, and/or perform an operation or data processing relating to communication. The processor 1020 executes one or more programs stored in the memory 1030.

The memory 1030 may include a volatile and/or non-volatile memory. The memory 1030 stores information, such as one or more of commands, data, programs (one or more instructions), applications 1034, etc., which are related to at least one other component of the electronic device 1000 and for driving and controlling the electronic device 1000. For example, commands and/or data may formulate an operating system (OS) 1032. Information stored in the memory 1030 may be executed by the processor 1020.

The applications 1034 include the above-discussed embodiments. These functions can be performed by a single application or by multiple applications that each carry out one or more of these functions.

The display 1050 includes, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, a quantum-dot light emitting diode (QLED) display, a microelectromechanical systems (MEMS) display, or an electronic paper display. The display 1050 can also be a depth-aware display, such as a multi-focal display. The display 1050 is able to present, for example, various contents, such as text, images, videos, icons, and symbols.

The interface 1040 may include input/output (I/O) interface 1042, communication interface 1044, and/or one or more sensors 1046. The I/O interface 1042 serves as an interface that can, for example, transfer commands and/or data between a user and/or other external devices and other component(s) of the electronic device 1000.

The sensor(s) 1046 can meter a physical quantity or detect an activation state of the electronic device 1000 and convert metered or detected information into an electrical signal. For example, the sensor(s) 1046 can include one or more cameras or other imaging sensors for capturing images of scenes. The sensor(s) 1046 can also include any one or any combination of a microphone, a keyboard, a mouse, one or more buttons for touch input, a gyroscope or gyro sensor, an air pressure sensor, a magnetic sensor or magnetometer, an acceleration sensor or accelerometer, a grip sensor, a proximity sensor, a color sensor (such as a red green blue (RGB) sensor), a bio-physical sensor, a temperature sensor, a humidity sensor, an illumination sensor, an ultraviolet (UV) sensor, an electromyography (EMG) sensor, an electroencephalogram (EEG) sensor, an electrocardiogram (ECG) sensor, an infrared (IR) sensor, an ultrasound sensor, an iris sensor, and a fingerprint sensor. The sensor(s) 1046 can further include an inertial measurement unit. In addition, the sensor(s) 1046 can include a control circuit for controlling at least one of the sensors included herein. Any of these sensor(s) 1046 can be located within or coupled to the electronic device 1000. The sensors 1046 may be used to detect touch input, gesture input, and hovering input, using an electronic pen or a body portion of a user, etc.

The communication interface 1044, for example, is able to set up communication between the electronic device 1000 and an external electronic device, or a server, the communication interface 1044 can be connected with a network through wireless or wired communication architecture to communicate with the external electronic device. The communication interface 1044 can be a wired or wireless transceiver or any other component for transmitting and receiving signals.

Figure 11. is a flowchart illustrating a method of creating semantic 3D building augmentation according to an embodiment of the disclosure.

The method 1100 may be performed by at least one processor using the electronic 1000 of Figure 1000.

As shown in Figure 11, in operation 1110, the method 1100 includes acquiring shapefile of an area and street view images taken in that area.

In operation 1120, the method 1100 includes converting the shapefile to a triangular mesh and computing camera parameters.

In operation 1130, the method 1100 includes extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation. The occluded regions from street view images are extracted using inpainting of occlusions. The automated 2D semantic segmentation includes the steps of predicting walls and windows that are part of buildings, predicting occluders, such as trees and cars, that obstruct parts of buildings, inpainting possible parts of the building blocked by the occluders, and detecting houses and buildings adapted to isolate from the label image the building of interest.

In operation 1140, the method 1100 includes projecting the 2D semantic segmentation labels to 3D models. The projecting the 2D semantic segmentation labels to 3D models projection includes the steps of matching building triangular mesh to their corresponding images, projecting 2D semantic labels to the building triangular mesh using pinhole camera model, and processing pose correction adapted to handle errors using the building bounding box.

In operation 1150, the method 1100 includes post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models. The post-processing, includes the steps of capturing mesh views adapted for each side of the initially labelled 3D model are rendered to images, 2D post-processing adapted for completing the labels, and reprojecting back the final 2D labels to the 3D model. The post-processing includes the steps of extracting information about the microstructures, and correcting rendering errors, view processing comprising of label completion of each generated view, the horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls, and completing labels using all views, it includes asserting wall continuity across views.

According to an embodiment of the disclosure, the occluded regions from street view images may be extracted using inpainting of occlusions.

According to an embodiment of the disclosure, the method may include predicting walls and windows that are part of buildings. The method may include predicting occluders, such as trees and cars, that obstruct parts of buildings. The method may include inpainting possible parts of the building blocked by the occluders. The method may include detecting houses and buildings adapted to isolate from the label image the building of interest.

According to an embodiment of the disclosure, the method may include matching building triangular mesh to their corresponding images. The method may include projecting 2D semantic labels to the building triangular mesh using pinhole camera model. The method may include processing pose correction adapted to handle errors using the building bounding box.

According to an embodiment of the disclosure, the method may include capturing mesh views adapted for each side of the initially labelled 3D model are rendered to images. The method may include 2D post-processing adapted for completing the labels. The method may include reprojecting back the final 2D labels to the 3D model.

According to an embodiment of the disclosure, the method may include extracting information about the microstructures, and correcting rendering errors. The method may include view processing comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls. The method may include completing labels using all views; wherein it includes asserting wall continuity across views.

According to an embodiment, the occluded regions from street view images may be extracted using inpainting of occlusions.

According to an embodiment, the at least one processor may be configured to predict walls and windows that are part of buildings. The at least one processor may be configured to predict occluders, such as trees and cars, that obstruct parts of buildings. The at least one processor may be configured toinpaint possible parts of the building blocked by the occluders. The at least one processor may be configured to detect houses and buildings adapted to isolate from the label image the building of interest.

According to an embodiment, the at least one processor may be configured to match building triangular mesh to their corresponding images. The at least one processor may be configured to project 2D semantic labels to the building triangular mesh using pinhole camera model. The at least one processor may be configured to process pose correction adapted to handle errors using the building bounding box.

According to an embodiment, the at least one processor may be configured to capture mesh views adapted for each side of the initially labelled 3D model are rendered to images. The at least one processor may be configured to 2D post-processing adapted for completing the labels. The at least one processor may be configured to reproject back the final 2D labels to the 3D model.

According to an embodiment, the at least one processor may be configured to extract information about the microstructures, and correcting rendering errors. The at least one processor may be configured to view process comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls. The at least one processor may be configured to complete labels using all views; wherein it includes asserting wall continuity across views.

According to an embodiment, the machine-readable medium may cause at least one processor of an electronic device to predict walls and windows that are part of buildings. The machine-readable medium may cause at least one processor of an electronic device to predict occluders, such as trees and cars, that obstruct parts of buildings. The machine-readable medium may cause at least one processor of an electronic device toinpaint possible parts of the building blocked by the occluders. The machine-readable medium may cause at least one processor of an electronic device to detect houses and buildings adapted to isolate from the label image the building of interest.

According to an embodiment, the machine-readable medium may cause at least one processor of an electronic device to match building triangular mesh to their corresponding images. The machine-readable medium may cause at least one processor of an electronic device to project 2D semantic labels to the building triangular mesh using pinhole camera model. The machine-readable medium may cause at least one processor of an electronic device to process pose correction adapted to handle errors using the building bounding box.

According to an embodiment, the machine-readable medium may cause at least one processor of an electronic device to capture mesh views adapted for each side of the initially labelled 3D model are rendered to images. The machine-readable medium may cause at least one processor of an electronic device to 2D post-processing adapted for completing the labels. The machine-readable medium may cause at least one processor of an electronic device to reproject back the final 2D labels to the 3D model.

According to an embodiment, the machine-readable medium may cause at least one processor of an electronic device to extract information about the microstructures, and correcting rendering errors. The machine-readable medium may cause at least one processor of an electronic device to view process comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls. The machine-readable medium may cause at least one processor of an electronic device to complete labels using all views; wherein it includes asserting wall continuity across views.

Claims

A method of creating semantic 3D building augmentation, the method comprising:

acquiring shapefile of an area and street view images taken in that area;

converting the shapefile to a triangular mesh and computing camera parameters;

extracting from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation;

projecting the 2D semantic segmentation labels to 3D models; and

post-processing of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
The method of creating semantic 3D building augmentation in accordance to claim 1, wherein the occluded regions from street view images are extracted using inpainting of occlusions.
The method of creating semantic 3D building augmentation in accordance to any one of claims 1 to 2, wherein said automated 2D semantic segmentation comprises the steps of:

predicting walls and windows that are part of buildings;

predicting occluders, such as trees and cars, that obstruct parts of buildings;

inpainting possible parts of the building blocked by the occluders; and

detecting houses and buildings adapted to isolate from the label image the building of interest.
The method of creating semantic 3D building augmentation in accordance to any one of claims 1 to 3, wherein the projecting the 2D semantic segmentation labels to 3D models projection, comprising the steps of:

matching building triangular mesh to their corresponding images;

projecting 2D semantic labels to the building triangular mesh using pinhole camera model; and

processing pose correction adapted to handle errors using the building bounding box.
The method of creating semantic 3D building augmentation in accordance to any one of claims 1 to 4, wherein said post-processing, comprising the steps of:

capturing mesh views adapted for each side of the initially labelled 3D model are rendered to images;

2D post-processing adapted for completing the labels; and

reprojecting back the final 2D labels to the 3D model.
The method of creating semantic 3D building augmentation in accordance to claim 5, wherein said 2D post-processing further comprising the steps of:

extracting information about the microstructures, and correcting rendering errors;

view processing comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls; and

completing labels using all views; wherein it includes asserting wall continuity across views.
An electronic device comprising:

at least one memory configured to store instructions; and

at least one processor configured, when executing the instructions, to:

acquire shapefile of an area and street view images taken in that area;

convert the shapefile to a triangular mesh and computing camera parameters;

extract from street view images the pixelwise location of the building and its features including occluded regions, using automated 2D semantic segmentation;

project the 2D semantic segmentation labels to 3D models; and

post-process of initial semantic label projection adapted to provide a more complete set of semantic labels for 3D models.
The electronic device of Claim 7, wherein the occluded regions from street view images are extracted using inpainting of occlusions.
The electronic device of any one of Claims 7 to 8, the at least one processor configured to:

predict walls and windows that are part of buildings;

predict occluders, such as trees and cars, that obstruct parts of buildings;

inpaint possible parts of the building blocked by the occluders; and

detect houses and buildings adapted to isolate from the label image the building of interest.
The electronic device of any one of Claims 7 to 9, wherein the at least one processor configured to:

match building triangular mesh to their corresponding images;

project 2D semantic labels to the building triangular mesh using pinhole camera model; and

process pose correction adapted to handle errors using the building bounding box.
The electronic device of any one of Claims 7 to 10, wherein the at least one processor configured to:

capture mesh views adapted for each side of the initially labelled 3D model are rendered to images;

2D post-processing adapted for completing the labels; and

reproject back the final 2D labels to the 3D model.
The electronic device of Claim 11, wherein the at least one processor configured to:

extract information about the microstructures, and correcting rendering errors;

view process comprising of label completion of each generated view; wherein horizontal boundary between the wall and the roof being acquired, and wall labels are propagated by labeling unlabelled pixels below the boundary as walls; and

complete labels using all views; wherein it includes asserting wall continuity across views.
A machine-readable medium containing instructions that, when executed, cause at least one processor of an electronic device to:

obtain multiple video frames of a video stream and multiple depth frames corresponding to the multiple video frames;

generate multiple blur kernel maps based on the multiple depth frames;

reduce depth errors in each of the multiple blur kernel maps;

perform temporal smoothing on the multiple blur kernel maps to suppress temporal artifacts between different ones of the multiple blur kernel maps; and

generate blur effects in the video stream using the multiple blur kernel maps.
The machine-readable medium of Claim 13, cause at least one processor of an electronic device to:

predict walls and windows that are part of buildings;

predict occluders, such as trees and cars, that obstruct parts of buildings;

inpaint possible parts of the building blocked by the occluders; and

detect houses and buildings adapted to isolate from the label image the building of interest.
The machine-readable medium of any one of Claims 13 to 14, cause at least one processor of an electronic device to:

match building triangular mesh to their corresponding images;

project 2D semantic labels to the building triangular mesh using pinhole camera model; and

process pose correction adapted to handle errors using the building bounding box.