WO2023007198A1

WO2023007198A1 - Training method for training a change detection system, training set generating method therefor, and change detection system

Info

Publication number: WO2023007198A1
Application number: PCT/HU2022/050058
Authority: WO
Inventors: Balázs NAGY; Lóránt KOVÁCS; Csaba BENEDEK; Tamás SZIRÁNYI; Örkény H. ZOVÁTHI; László TIZEDES
Original assignee: Szamitastechnikai Es Automatizalasi Kutatointezet
Current assignee: Szamitastechnikai Es Automatizalasi Kutatointezet
Priority date: 2021-07-27
Filing date: 2022-07-08
Publication date: 2023-02-02
Anticipated expiration: 2024-01-27
Also published as: EP4377913A1

Abstract

The invention is a training method for training a change detection system for detecting change for a coarsely registered pair (200) of first and second 3D information data blocks (200a, 200b), in a training cycle - generating (S180) by change detection generator module (210) change data blocks (215) for the coarsely registered pairs (200) and generating (S185) generator loss (225) based on change data blocks (215) and target change data blocks (205), - generating (S190) discriminator loss (230) by discriminator module (220) on coarsely registered pairs (200), target change data blocks (205) and change data blocks (215), and - training (S195) the change detection generator module (210) by a combined loss (235) of summing generator and discriminator losses (225, 230) multiplying any of these by loss multiplicator (λ). The invention is furthermore a change detection system and a training set generating method for the training method.

Description

TRAINING METHOD FOR TRAINING A CHANGE DETECTION SYSTEM, TRAINING SET GENERATING METHOD THEREFOR, AND CHANGE DETECTION SYSTEM

TECHNICAL FIELD

The invention relates to a training method for training a change detection system, a training set generating method therefor, and a change detection system.

DESCRIPTION OF PRIOR ART

As a background: There is an opportunity to map and represent the 3D environment in the form of point clouds with the help of several different types of sensor technologies (examples: Lidar laser scanners, infrared scanners, stereo and multi view camera systems).

These point clouds, as opposed to traditional 2D photos and multispectral images, can be considered a proportionate 3D model of the environment, in which the relative position and size of each object can be determined on a scale identical to that of the world coordinate system. However, their disadvantage is that the set of points is only a representation of an otherwise continuous surface observable in the world, obtained by a discrete sampling. Furthermore, the sampling characteristics (point density, sampling curves) of different sensors (and sensor settings) might be significantly different.

It is also a difficulty that the accurate registration of two point clouds generated at the same location but at different times, using a moving/movable platform (i.e. the designation of a common reference point of choice and of the exact orientation, see below for more details of the meaning of registration), is not always possible/its accuracy is uncertain. The reason for this is, on the one hand, the inaccuracy of (IMU - Inertial Measurement Unit/GNSS - Global Navigation Satellite Systems) navigation devices or the lack of an accurate signal necessary for the navigation (e.g. GPS - Global Positioning System).

On the other hand, although there are algorithms that can be used to match (register) point clouds to each other (e.g. Iterative Closest Point, ICP algorithms), their accuracy and reliability may be weaker, especially if the point clouds to be compared are sparse and have irregular, inhomogeneous density characteristics.

Differences between point cloud sites (for example the appearance or disappearance of an article (object) or a moving object) become significantly more difficult to detect if accurate registration cannot be assumed between samples obtained as a function of time (time samples).

Due to the increasing population density, the rapid development of smart city applications and autonomous vehicle technologies, growing demand is emerging for automatic public infrastructure monitoring and surveillance applications. Detecting possibly dangerous situations caused by e.g. missing traffic signs, faded road signs and damaged street furniture is crucial. Expensive and time-consuming efforts are required therefore by city management authorities to continuously analyse and compare multi-temporal recordings from large areas to find relevant environmental changes.

From the perspective of machine perception, this task can be formulated as a change detection (CD) problem. In video surveillance applications (see C. Benedek, B. Galai, B. Nagy, and Z. Janko, “Lidar-based gait analysis and activity recognition in a 4d surveillance system,” IEEE Trans. Circuits Syst. Video Techn., vol. 28, no.1, pp. 101-113, 2018. and F. Oberti, L. Marcenaro, and C. S. Regazzoni, “Real-time change detection methods for video-surveillance systems with mobile camera,” in European Signal Processing Conference, 2002, pp. 1-4.) change detection is a standard approach for scene understanding by estimating the background regions and by comparing the incoming frames to this background model.

Change detection is also a common task in many remote sensing (RS) applications, which require the extraction of the differences between aerial images, point clouds, or other measurement modalities (see C. Benedek, X. Descombes, and J. Zerubia, “Building development monitoring in multitemporal remotely sensed image pairs with stochastic birth-death dynamics,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 1, pp. 33-50, 2012. and S. Ji, Y. Shen, M. Lu, and Y. Zhang, “Building instance change detection from large-scale aerial images using convolutional neural networks and simulated samples,” Remote Sensing, vol. 11, no. 11, 2019.). However, the vast majority of existing approaches assume that the compared image or point cloud frames are precisely registered since either the sensors are motionless or the accurate position and orientation parameters of the sensors are known at the time of each measurement.

Mobile and terrestrial Lidar sensors can obtain point cloud streams providing accurate 3D geometric information in the observed area. Lidar is used in autonomous driving applications supporting the scene understanding process, and it can also be part of the sensor arrays in ADAS (advance driver assistance) systems of recent high-end cars. Since the number of vehicles equipped with Lidar sensors is rapidly increasing on the roads, one can utilize the tremendous amount of collected 3D data for scene analysis and complex street-level change detection. Besides, change detection between the recorded point clouds can improve virtual city reconstruction or Simultaneous Localization and Mapping (SLAM) algorithms (see C.-C. Wang and C. Thorpe, “Simultaneous localization and mapping with detection and tracking of moving objects,” in Int. Conf. on Robotics and Automation (ICRA), vol. 3, 2002, pp. 2918-2924.)

Processing street-level point cloud streams is often a significantly more complex task than performing change detection in airborne images or Lidar scans. From a street-level point of view, one must expect a larger variety of object shapes and appearances, and more occlusion artifacts between the different objects due to smaller sensor-object distances.

Also, the lack of accurate registration between the compared 3D terrestrial measurements may mean a crucial bottleneck for the whole process, for two different reasons: First, in a dense urban environment, GPS/GNSS-based accurate self-localization of the measurement platform is often not possible (see B. Nagy and C. Benedek, “Real-time point cloud alignment for vehicle localization in a high resolution 3d map,” in ECCV 2018 Workshops, LNCS, 2019, pp. 226-239.). Second, the differences in viewpoints and density characteristics between the data samples captured from the considered scene segments may make automated point cloud registration algorithms less accurate (see B. Nagy and C. Benedek, “Real time point cloud alignment for vehicle localization in a high resolution 3d map,” in ECCV 2018 Workshops, LNCS, 2019, pp. 226-239.). As one of the most fundamental problems in multitemporal sensor data analysis, change detection has had a vast bibliography in the last decade. Besides methods working on remote sensing images, several change detection techniques deal with terrestrial measurements, where the sensor is facing towards the horizon and is located on or near the ground. In these tasks optical cameras (see A. Varghese, J. Gubbi, A. Ramaswamy, and P. Balamuralidhar, “Changenet: A deep learning architecture for visual change detection,” in ECCV 2018 Workshops, LNCS, 2019, pp. 129-145.) and rotating multi-beam Lidars (see Y. Wang, Q. Chen, Q. Zhu, L. Liu, C. Li, and D. Zheng, “A survey of mobile laser scanning applications and key techniques over urban areas,” Remote Sensing, vol. 11, no. 13, pp. 1-20, 2019.) are frequently used, solving problems related to surveillance, map construction, or SLAM algorithms (see W. Xiao, B. Vallet, K. Schindler, and N. Paparoditis, “Street- side vehicle detection, classification and change detection using mobile laser scanning data,” ISPRS J. Photogramm. Remote Sens., vol. 114, pp. 166-178, 2016.).

We can categorize the related works based on the applied methodology they use for change detection. Many approaches are based on handcrafted features, such as a set of pixel- and object-level descriptors (see P. Xiao, X. Zhang, D. Wang, M. Yuan, X. Feng, and M. Kelly, “Change detection of built-up land: A framework of combining pixel-based detection and object-based recognition,” ISPRS J. Photogramm. Remote Sens., vol. 119, pp. 402-414, 2016.), occupancy grids (see W. Xiao, B. Vallet, M. Bredif, and N. Paparoditis, “Street environment change detection from mobile laser scanning point clouds,” ISPRS J. Photogramm. Remote Sens., vol. 107, pp. 38-49, 9 2015.), volumetric features, and point distribution histograms (see W. Xiao, B. Vallet, K. Schindler, and N. Paparoditis, “Street-side vehicle detection, classification and change detection using mobile laser scanning data,” ISPRS J. Photogramm. Remote Sens., vol. 114, pp. 166-178, 2016.), but they all need preliminarily registered inputs. Only a few feature-based techniques deal with compensating small misregistration effects, such as R. Qin and A. Gruen, “3D change detection at street level using mobile laser scanning point clouds and terrestrial images,” ISPRS J. Photogramm. Remote Sens., vol. 90, pp. 23-35, 2014., where terrestrial images and point clouds are fused to perform change detection. Neural network-based change detection techniques can handle in general more robustly the variances originated from viewpoint differences, most frequently using Siamese network architectures. However, prior approaches solely focus here on visual change detection problems in aerial (see Y. Zhan, K. Fu, M. Yan, X. Sun, H. Wang, and X. Qiu, “Change Detection Based on Deep Siamese Convolutional Network for Optical Aerial Images,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 10, pp. 1845-1849, 2017.) or street-view (see A. Varghese, J. Gubbi, A. Ramaswamy, and P. Balamuralidhar, “Changenet: A deep learning architecture for visual change detection,” in ECCV 2018 Workshops, LNCS, 2019, pp. 129-145. and E. Guo, X. Fu, J. Zhu, M. Deng, Y. Liu, Q. Zhu, and H. Li, “Learning to measure change: Fully convolutional Siamese metric networks for scene change detection,” 2018, arXiv: 1810.09111) optical image pairs.

Accordingly, change detection is sometimes applied on simple - mostly registered - aerial photographs (remote sensing applications, see above).

Furthermore, in US 2015/0254499 A1 change detection method for point clouds is disclosed. In this prior art method aligning of the two different point clouds is done in order to make change detection.

DISCLOSURE OF THE INVENTION

The primary object of the invention is to provide a training method for training a change detection system, a training set generating method therefor, and a change detection system which are free of the disadvantages of prior art approaches to the greatest possible extent. Furthermore, an object of the invention is to provide solution for these methods and system applicable for a coarsely registered pair of 3D information data blocks (data blocks may e.g. be point clouds, see herebelow for some specific features). More specifically, the object of the invention (i.e. of our proposed solution) is to extract changes between two coarsely registered sparse Lidar point clouds.

The object of the method is to provide a machine learning based solution to compare only coarsely (approximately) registered 3D point clouds made from a given 3D environment and to determine the changed environmental regions without attempting to specify the registration (more generally, without performing registration).

The objects of the invention can be achieved by the training method for training a change detection system according to claim 1, the change detection system according to claim 6, and the training set generating method according to claim 7. Preferred embodiments of the invention are defined in the dependent claims.

The task of change detection is solved by the invention e.g. for real Lidar point cloud- (generally, for 3D information data block) based change detection problems.

About registration issues: Most of the aforementioned prior art methods require that the compared measurements are either recorded from a static platform, or they can be accurately registered into a joint coordinate system by using external navigation sensors, and/or robust image/point cloud matching algorithms.

The later registration step is critical for real-world 3D perception problems, since the recorded 3D point clouds often have strongly inhomogeneous density, and the blobs of the scanned street-level objects are sparse and incomplete due to occlusions and the availability of particular scanning directions only. Under such challenging circumstances, conventional point-to-point, patch-to-patch, or point-to-patch correspondence-based registration strategies often fail (see R. Qin, J. Tian, and P. Reinartz, “3D change detection - Approaches and applications,” ISPRS J. Photogramm. Remote Sens., vol. 122, no. Cd, pp. 41-56, 2016.).

To our best knowledge, this description is the first approach to solve the change detection problem among sparse, coarsely registered terrestrial point clouds, without needing an explicit fine registration step. Our proposed - preferably deep learning-based - method can extract and combine various low-level and high-level features throughout the convolutional layers, and it can learn semantic similarities between the point clouds, leading to its capability of detecting changes without prior registration.

Preferably, according to the invention, a deep neural network-based change detection approach is proposed, which can robustly extract changes between sparse point clouds obtained in a complex street-level environment, i.e. the invention is preferably a deep (learning) network for change detection (alternatively, for detecting changes) in coarsely registered point clouds. As a key feature, the proposed method does not require precise registration of the point cloud pairs. Based on our experiments, it can efficiently handle up to 1m translation and 10° rotation misalignment between the corresponding 3D point cloud frames.

The method according to the invention, preferably following human perception, provides a machine-learning-based method for detecting and marking changes in discrete, only coarsely (approximately) registered point clouds.

According to the invention it is preferably assumed that point clouds represent the environment to scale (proportionately) and have a scale factor equal to that of the world coordinate system (for example, the distance between two characteristic points is the same as in the environment e.g. expressed in centimetres). Furthermore, the point clouds were created (collected, generated by measurement) in the same area, i.e. , from almost the same reference point and with a similar orientation, however, their exact reference positions (typically corresponds to the place of data collection) and orientations relative to each other are unknown.

BRIEF DESCRIPTION OF THE DRAWINGS

Preferred embodiments of the invention are described below by way of example with reference to the following drawings, where

Figs. 1A-1B illustrate an exemplary input image pair,

Figs. 1 C-1 D illustrate target change images for the inputs of Figs. 1 A-1 B,

Fig. 2 illustrates the internal structure of the change detection generator module in an embodiment with showing inputs and outputs,

Fig. 3A shows an embodiment of the training method according to the invention,

Fig 3B shows a flowchart of the training set generating method according to the invention,

Figs. 4A-4E are illustrations for the invention in a scene,

Fig. 5A is a fused target change image for Figs. 1 C-1 D,

Fig. 5B is a fused output change image for Figs. 1 A-1 B,

Figs. 5C-5D are results for input of Figs. 1A-1B obtained by prior art approaches, Figs. 6A-6H are illustrations for the invention in an exemplary scene,

Figs. 6I-6J are shadow bars for Figs. 6A-6FI, and

Fig. 7 is a diagram showing the comparison of the results obtained by an embodiment of the change detection system according to the invention and prior art techniques.

MODES FOR CARRYING OUT THE INVENTION

In the following, after introducing general aspects of the training method according to the invention, some embodiments and other details of the proposed method will be given (more particularly details of the training method according to the invention, as well as the training set generating method). This detailing of the proposed method contains subparts relating to range image representation, ChangeGAN architecture (see below for the meaning of ChangeGAN), training (of) ChangeGAN and change detection dataset (training set generated by the training set generating method according to the invention, which is a further method according to the invention). This is followed by a part of the description where the experiments (related to the invention) are described.

Some embodiments of the invention relate to a training method for training a change detection system for detecting change for a coarsely registered pair of (or alternatively, having) a first 3D information data block and a second 3D information data block (a 3D information data block may be e.g. a range image or a point cloud, see below for details), wherein

- the change detection system comprises a change detection generator module (see in Fig. 2 as well as in Fig. 3A its embodiments) based on machine learning (e.g. neural network in Fig. 3A) and adapted for generating a change data block (e.g. a change image, but changes assigned to points of a point cloud are also conceivable) for a coarsely registered pair of a first 3D information data block and a second 3D information data block (the change detection generator module is naturally - like any machine learning module - adapted for generating a change data block at the very beginning of the training method, since the untrained module starts from a status, where preliminary change data blocks can be generated, however, these change data blocks getting better and better - i.e. becoming to be in line with the training goals - during the at least one training cycle, see below, of the training method), and - in course of the training method a discriminator module based on machine learning is applied (according to the invention, this is necessary to train the change detection generator module; it has naturally a role during the training method, but no role after the training; see Fig. 3A of an embodiment of discriminator module labelled as discriminator network).

About the preferred manifestations of the 3D information data blocks, see the followings. In the training method the training set preferably comprises range images as 3D information data block and target change images as target change data blocks. However, the training set generating method preferably starts from a point cloud as a base 3D information data block of a registered base pair (see below), and preferably ends in a training set utilizable for the training method, i.e. range images and corresponding target change images.

The training method according to the invention can be interpreted based on Fig. 3A (see far below even more details about Fig. 3A): the figure illustrates the first 3D information data block and the second 3D information data block labelled by inputs.

According to the invention, in at least one training cycle of the training method for a plurality of coarsely registered pairs of first 3D information data block and second 3D information data block (all coarsely registered pair has a respective first 3D information data block and a respective second 3D information data block) and a plurality of respective target change data blocks of a training set (the training method is for training a change detection system for detecting change for a coarsely registered pair; for training this way we naturally need a plurality of coarsely registered pairs and, according to the training strategy, ground truth images corresponding thereto, which are target change data blocks (e.g. target change images) in the framework of the invention; the plurality of coarsely registered pairs and the corresponding plurality of target change data blocks constitute the training set: of. with the uppermost row of flowchart blocks in Fig. 3A; a target change data block or target change data block pair - see below - naturally correspond to a specific coarsely registered pair: these are of near the same scene/location having thus correlated content, see also below) - change data blocks are generated by means of the change detection generator module for the plurality of coarsely registered pairs of the training set (see operational step S180 in Fig. 3A in an embodiment) and generating (S185) a generator loss contribution based on corresponding (auxiliary) combinations of a change data block and a target change data block (see branches first operational substep S185a to take the generated change data block and second operational substep S185b to take the target change data block into account in the generator loss; substeps S185a and S185b are parts of operational step S185 in the embodiment of Fig. 3A),

- a discriminator loss contribution is generated by applying the discriminator module on a plurality of coarsely registered pairs of the training set, as well as corresponding target change data blocks of the training set and corresponding change data blocks (see operational step S190 in Fig. 3A in an embodiment), and

- the change detection generator module is trained by a combined loss obtained from a summation of the generator loss contribution and the discriminator loss contribution, wherein in the summation at least one of the generator loss contribution and the discriminator loss contribution is multiplied by a respective loss multiplicator (see operational step S195 corresponding to training in Fig. 3A in an embodiment, where in a combined loss 235 the generator loss contribution 225 is multiplied by a l loss multiplicator 227 and a discriminator loss contribution 230 is simply added; the loss multiplicator has a predetermined/predefined multiplicator value; in other words, training is performed in operational step S195, accordingly, it is a training step).

The above steps of the training cycle are performed one after the other in the above order. Furthermore, after the last step (training step), the steps are again started from the first (change data block generation), if a next iteration cycle is started (see below for the termination of the training method). This is clear also from their inputs and outputs: change data block is generated in the first step and used in the second step as a possible input of the discriminator module. The generator and discriminator loss contributions are generated in the first and second step, respectively, and play the role of an input in the third step. The above introduction of the training method according to the invention is interpreted in some aspects in the following, where also some optional features are given or just touched upon for the sake of helping the interpretation of the above introduction. First of all, it is noted that the training method is for (in particular, suitable for) training a change detection system. The system is trained for detecting change, i.e. detection of any change which can be revealed between its inputs. By the help of the system, the change is detected for a coarsely registered pair of 3D information data blocks.

As an output of the change detection generator module a (i.e. at least one) change data block is generated. Preferably, a pair of change data blocks (change images) are generated as detailed in some examples, but in some applications (e.g. the content of any of change data blocks is not relevant), it is enough to generate only one change data block. When this single change data block is processed by the discriminator module, it may be accompanied by a single target change data block.

During the training, i.e. in a training cycle, three main steps are performed:

- change data block and generator loss contribution generation;

- discriminator loss contribution generation; and

- training of the change detection generator module (naturally, in this last step also the discriminator module is trained; it is trained by the discriminator loss contribution).

The training cycle is formulated for a plurality of coarsely registered pairs and a plurality of respective target change data blocks of a training set (the latter many times called epoch, in a training cycle a batch of training data - being a part of the epoch - is utilized, after that the machine learning module is trained by the calculated loss, e.g. weights of a neural network are updated; see also below), and this approach is followed in the steps in the above description of the invention, but it worth to show some details of the steps for a single input and output, i.e. to illustrate the method for a single processing.

- accordingly, one (or two, i.e. a pair) change data block is generated for a coarsely registered pair, and by the help of this change data block and the corresponding target change data block (being also the part of the ground truth) a member of the generator loss contribution can be calculated (e.g. according to the definition of the Li loss as given in an example below; the whole loss contribution - here and below for the discriminator - can be generated based on the plurality of coarsely registered pairs); in the generator loss contribution, many times difference is taken into account for a corresponding combination (the word ‘combination’ is brought only for showing which are the corresponding entities) of the change data block and the target change data block, thus it may be formulated also on this basis;

- as in the above step, generation of the discriminator loss contribution is formalized for the plurality of coarsely registered pairs of the training set, for a single processing - as illustrated in Fig. 3A - the discriminator is applied on corresponding sets of a coarsely registered pair, a target change data block and a change data block (see below for details);

- in the training step training is also performed based on the plurality of coarsely registered pairs and the plurality of respective target change data blocks, since these all have taken into account for determining the generator and discriminator loss contributions (in other words, training is performed after a batch is processed in the above steps of a training cycle); furthermore, it is noted that for the summation, the multiplicator may be applied to any of the contributions or for both.

Note the following about the limitations (i.e. managing the termination) of the training method. As it is often the case in training methods, an optimizer will determine, based on the loss, at which point (after which training cycle) in the training method it is advantageous to stop (terminate) the training.

The l parameter (loss multiplicator) will handle the loss contributions (it can no longer be seen afterwards which loss contributes to the combined loss). The appropriate value of the loss multiplicator is preferably predetermined in a way that the training is performed for several values of the loss multiplicator, and the appropriate value is selected based on the results. During training, we typically only follow (investigate) the value of the global loss (the global loss function is preferably a differentiable function). For the global loss, we may require that its value falls below a threshold, but preferably we look for a small change from the previous value or values, as oscillations may occur in evolution of the loss value. The value of the loss multiplicator is a function of many aspects. Since we are aiming for a minimum global loss, it is chosen (see below) to provide a balance between the two loss contributions. At the beginning of the training the competing networks are essentially random networks. It can be calculated by tests on the training method, so as to check whether the parameter value is able to reach an appropriate balance between the generator and discriminator loss contributions.

The loss contributions are generated based on the outputs of the generator and discriminator modules. Separate modules may be dedicated for generating loss (loss generating modules), but this task may be considered also as a task of the generator and discriminator, themselves.

From the name of the change detection generator module, the generator word may be skipped, i.e. it may be simply called change detection module. Moreover, the generator and discriminator modules could be called first and second competing modules beside that their functions are defined (what output(s) are generated based on the inputs).

The aforementioned devices (tools) for surveying the 3D environment (Lidar laser scanners, infrared scanners, stereo and multi-view camera systems) and other such devices provide 3D information (i.e. a 3D information data block) of the environment and any representation (point cloud, range image, etc.) can be extracted from it.

Under the term “based on machine learning” is meant that the respective module is realized by the required element(s) of a machine learning approach, preferably by neural networks as shown in many examples in the description. All of the machine learning based modules applied in the invention can be realized by the help of neural networks, but the use of alternative machine learning approaches is also conceivable.

For the interpretation of the concept of being two 3D information data block coarsely registered to each other, the followings are noted. The meaning of being ‘coarsely registered’ is clearly differentiated from being ‘registered’ upon the common part that there is not too large difference in the place and orientation between the two data blocks (e.g range images or point clouds) which are under investigation, see below for details:

- ‘coarsely registered’ relates to a data block pair (e.g. a range image pair or a point cloud pair) which are not registered, meaning more specifically, that for which the registration is not performed or the registration information is not utilized, i.e. the translation and/or rotation which is necessary to bring the images of the image pair into alignment (i.e. there common/overlapping part where however naturally can be changes) is not determined or not utilized;

- ‘registered’ relates to a data block pair (e.g. a range image pair or a point cloud pair) which are registered, i.e. the translation and/or rotation which is necessary to bring the images of the image pair into alignment is determined and utilized during the processing of the image pair (registration is another approach based explicitly on using the registration information for processing).

Accordingly, the change detection system according to the invention becomes - by means of the training method - adapted for detecting changes for a coarsely registered pair, for which the translation and/or rotation by which these could be aligned is not determined, but if this translation and/or rotation would be determined the value of these would be restricted. The data blocks (images) of a coarsely registered image pair are thus unprocessed data blocks (images) from this point of view, i.e. the complicated step of registering is not performed on them (or the registration information is not used, this latter can also be meant under the meaning of non-registered).

This is highly advantageous, since the input data blocks (i.e. the coarsely registered pair) can be subjected to the change detection system according to the invention without the application of this complicated preparation and thus the change detection system is ready for application on input data blocks (images) which are unprocessed (raw) in this respect (i.e. becomes suitable for applying on this kind of pair by the help of the training method according to the invention).

An alternative term for "coarsely registered" may be "data block (e.g. point cloud or range image) pair with restricted relative translation and/or rotation". The word ‘relative’ is clearly meant on the data blocks of the data block pair, i.e. the translation and/or rotation is interpreted between the data blocks of the pair. For the restricted translation and/or rotation, see in an example that it is preferably up to ±1m translation and/or an up to ±10° rotation (in the real coordinate system corresponding to them). These limits can thus preferably be part of the 'coarsely registered' requirement for the inputs (however, as a consequence of the definition itself, this property of the input data pair is not checked in the absence of registration), since these data also specify how much they are translated/rotated with respect to each other, i.e. these parameters specify how different the two inputs are from each other, how much overlap there is between them. This is what the change detection system becomes adapted for, i.e. if it receives inputs with a very large translation and/or rotation, it may perform worse.

A registered point set means that it is spatially (or temporally) aligned, i.e. matched. If the exact registration is known, our method is not needed.

We use the term "coarsely registered" because indeed by preferably assuming that the offset difference is at most 1m and the rotation difference is at most 10°, we are expressing that they are coarsely registered in this sense, i.e. being spatially close, since comparison between two different data from completely unknown locations makes no sense.

In other words, by their spatial proximity in this sense we mean that they are "coarsely registered", the matching required for a registered set of points is not done on them or not used (i.e. it is as if it does not exist).

So, we can handle a certain degree of unregisteredness. The term ..coarsely registered” may also be changed for "barely/slightly registered".

Moreover, as it will be discussed in more detail below in connection with the STN module, the larger the value of the translation and/or the rotation, the task become harder for the preferably applied STN module to work, since the applicability of the STN module is restricted for lower values of translation and/or rotation (see for the details below). Of course, the content of the two input images (generally, data blocks) typically overlaps to a large extent, and they are essentially taken from the same scene. This typically relates to the basic similarity of the background and there are usually objects whose displacement relative to the two images is precisely what we are looking at with change detection, but of course there can also be changes in the background that are subject to change detection.

It is, furthermore noted, that the inputs to be processed may also include an image pair that is registered, i.e. has the data available to register it (e.g. translation and/or rotation for alignment), in which case these data are not used, as the input to the change detection system is only the coarsely registered pair itself.

In accordance with the above introduction of the training method, some embodiments of the invention relate to a change detection system adapted (as a consequence of being trained, see below) for detecting change for a coarsely registered pair of a first 3D information data block and a second 3D information data block, wherein the system

- is trained by any of the embodiments of the training method according the invention, and

- comprises the change detection generator module (which is thus trained as a part of the change detection system as detailed in connection with the training method, i.e. involving the discriminator module).

According to the description, the change detection system may be also called change detection architecture. It can be equated by its main (only) module in its training state, the change detection generator module (in the training state the system does not comprise the discriminator module), since these - i.e. the change detection system and the change detection generator module - have the same inputs (a coarsely registered pair) and the same output (change data block(s)).

As touched upon above, the input of the change detection system, as well as the respective part of the training set for the training method is called first and second 3D information data blocks, and their pair is called coarsely registered pair (may be called coarsely registered (input) data block pair). A 3D information data block may be any image, such as range image or depth image, which bearing 3D information/3D location information of the parts of the environment (these are 3D information images, but these could be also called simply 3D images) or other data structure bearing 3D information, such as a point cloud.

It is considered to be in the framework of the invention to use (to apply) a change detection system trained by the training method according to according to the invention to an (input) data block pair (in particular a coarsely registered data block pair, but it can be applied to that input which is given as input) of a first 3D information data block and a second 3D information data block for change detection between them. In this case the change detection generator module of the trained change detection system is applied to the inputs and gives change data block(s) as an output.

Furthermore, in an embodiment

- the coarsely registered pair of the first 3D information data block and the second 3D information data block is constituted by a coarsely registered range image pair of a first range image and a second range image (so, the 3D information data block itself may be a range image), or

- a coarsely registered range image pair of a first range image and a second range image is generated from the coarsely registered pair of the first 3D information data block and the second 3D information data block before the application of the at least one training cycle (accordingly, as an option the range image may also be generated from another 3D information data block, which was e.g. originally a point cloud).

These are naturally meant on the plurality of the coarsely registered pairs.

Also in other parts of the description but mainly from this point, because of introducing coarsely registered range image pair as the input, we mainly use the expressions ..coarsely registered image pair” for ..coarsely registered pair (of 3D information data blocks)”, „change/target change images” instead of „change/target change data blocks”, and consequently we use „image” instead of more general „data block”, and „point cloud” instead „data block” in case of training set generation.

All of the possible representations of the 3D information data blocks may be applied in the invention (it is applicable e.g. also on point clouds), applying range images as 3D information data blocks may be preferred since the preferably applied convolutions in the machine learning based modules handle easier such a format (but the convolution is also applicable also on a point cloud). Since it is preferred to use, many details are introduced in the following with illustrating on range images. On the most general level these are termed as 3D information data blocks, but all features introduced in a less general level but being compatible with the most general level are considered to be utilizable on the most general level.

In a further embodiment a change mask image having a plurality of first pixels is generated by the change detection generator module as the change data block and the target change data block is constituted by a target change mask image having a plurality of second pixels, wherein to each of the plurality of first pixels and a plurality of second pixels presence of a change or absence of a change is assigned (accordingly, we have preferably change mask image and target change mask image as a manifestation of the change data block and the target change data block; for target or ground truth change mask, see Figs. 1 C and 1 D; a fused change mask is illustrated in Fig. 5B; change mask image and target change mask images are preferably 2D images).

Accordingly, in an embodiment the change data block is constituted by a change mask image, and target change mask images are used as target change data blocks. In this embodiment, according to the above definition, the presence or absence of a change is denoted in every pixel. E.g. such mask image may be applied in which denotes a change and Ό’ denotes the other regions where there is no change (e.g. where small number of changes are identified in an image, there is a small number of ‘1’ on the image and there are Ό’ in other regions). This illustrates what is meant on “presence of a change or absence of a change is assigned (denoted) to each pixel”: changes have to be differentiated from every other region where no change has been identified.

In an embodiment the above two embodiments can be combined. Accordingly, in an embodiment range images and mask images are utilized. In this case all of the data “circulated” during the training process is represented by 2D images: the generator module receives 2D range images from which it generates 2D mask images. The generator loss contribution can be calculated based on the mask images. The discriminator module also handles 2D images in this case (also the target is a mask image). This choice is further detailed below and it can be utilized advantageously in a highly effective way.

Furthermore, in the above combined embodiment a pair of change images are applied (i.e. for both “directions”, see below), and, accordingly, also a pair of target change images is utilized for them in course of the training method. The utilization of change data block pairs (i.e. that we have pairs) can be applied independently from the range image and mask image approach.

In the following, some (optional) aspects of the invention are given shown as starting from point clouds.

Several Lidar devices, such as the Rotating multi-beam (RMB) sensors manufactured by Velodyne and Ouster, can provide high frame-rate point cloud streams containing accurate, but relatively sparse 3D geometric information from the environment. These point clouds can be used for infrastructure monitoring, urban planning (see B. Galai and C. Benedek, “Change detection in urban streets by a real time Lidar scanner and MLS reference data,” in Int. Conf. Image Analysis and Recognition, LNCS, 2017, pp. 210-220.) and SLAM (see C.-C. Wang and C. Thorpe, “Simultaneous localization and mapping with detection and tracking of moving objects,” in Int. Conf. on Robotics and Automation (ICRA), vol. 3, 2002, pp. 2918-2924.).

To formally define the change detection task according to the invention, several considerations should be taken. First, both input point clouds (Pi and P2) may contain various dynamic or static objects, which are not present in the other measurement sample. Second, due to the lack of registration, we cannot use a single common voxel grid for marking the locations of changes between the two point clouds (cf. voxels used in training set generation; there we can use voxels because we know the transformation, but not on real data because in this case we do not know the transformation).

Instead, using a m(.) point labelling process, we separately mark each point p 6 Pi u P2 as changed (m(r) = ch) or unchanged background (m(r) = bg) point, respectively. We label a point pi e Pi as changed if the surface patch represented by point pi in Pi is not present (changed or occluded) in point cloud P2 (the label of a point p2 £ P2 is similarly defined). Accordingly, the labelling is done for Pi and P2 separately. Results of the proposed classification approach (i.e. the trained change detection system with its module, the trained change detection generator module, which could also be called classification module instead of generator module) for a sample 3D point cloud pair are demonstrated in Figs. 4A-4E, see also Figs. 1A-1D. It is furthermore noted that generation of a change mask image can be considered as a classification task: it has to be classified whether there is a change or not at a point (e.g. pixel).

About range image representation preferably applied in the framework of the invention, see the followings. Our proposed solution preferably extracts changes between two coarsely registered Lidar point clouds in the range image domain. For example, creating a range image from a rotating multi-beam (RMB) Lidar sensor's point stream is straightforward (see C. Benedek, “3d people surveillance on range data sequences of a rotating lidar,” Pattern Recognition Letters, vol. 50, pp. 149- 158, 2014, depth Image Analysis.) as its laser emitter and receiver sensors are vertically aligned, thus every measured point has a predefined vertical position in the image, while consecutive firings of the laser beams define their horizontal positions. Geometrically, this mapping is equivalent to transforming the representation of the point cloud from the 3D Descartes to a spherical polar coordinate system, where the polar direction and azimuth angles correspond to the horizontal and vertical pixel coordinates, and the distance is encoded in the corresponding pixel's 'intensity' value.

Note that range image mapping can also be implemented for other (non-RMB) Lidar technologies, such as for example Lidar sensor manufactured by Livox. Using appropriate image resolution, the conversion of the point clouds to 2D range images is reversible, without causing information loss. Besides providing a compact data representation, using the range images makes it also possible to adopt 2D convolution operations by the used neural network architectures.

The proposed deep learning approach in an embodiment takes as input two coarsely registered 3D point clouds Pi and P2 represented by range images h and I2, respectively (shown in Figs. 1A and 1B) to identify changes. Our architecture assumes that the images h and I2 are defined over the same pixel lattice S, and have the same spatial height (h), width (w) dimensions (i.e. the two input range images have the same resolution, the two images have the same number of horizontal and vertical pixels).

Usually, change detection algorithms working on multitemporal image pairs (see A. Varghese, J. Gubbi, A. Ramaswamy, and P. Balamuralidhar, “Changenet: A deep learning architecture for visual change detection,” in ECCV 2018 Workshops, LNCS, 2019, pp. 129-145.) explicitly define a test and a reference sample, and changes are interpreted from the perspective of the reference data: the resulting change mask marks the image regions which are changed in the test image compared to the reference one.

However, this approach cannot be adopted in our case. It is not relevant to assign a single binary change/background label to the pixels of the joint lattice S of the range images, as they may represent different scene locations in the two input point clouds (because of the coarsely registered nature). For this reason, we preferably represent the change map by a two-channel mask image over S, so that to each pixel s e S we assign two binary labels Ai(s) and A2(s) (change masks separately for Pi and P2 as mentioned above, i.e. an own change mask is generated for each of Pi and P2). Following our change definition used earlier in 3D, for i 6 {1,2}, Ai(s) = ch encodes that the 3D point pi e Pi projected to pixel s should be marked as change in the original 3D point cloud domain of Pi, i.e. p(pi) = ch (see Figs. 1C and 1 D).

It is hereby noted that some of the figures were originally coloured (not greyscale), such figures are Figs. 1A-1D, Figs. 4A-4E, Figs. 6A, 6C, 6E, 6G. The different colours - necessarily - can be differentiated also in greyscale. Accordingly, in some places the colours are referenced, but to an extent that can interpreted based on the figures transformed to greyscale.

Figs. 1A-1D illustrates input data representation for the training method according to the invention. Figs. 1A-1B show exemplary range images h, I2 (being the realisation of first and second 3D information data blocks 100a and 100b in this embodiment) from a pair of coarsely registered point clouds Pi and P2. Figs. 1 C-1 D show binary ground truth change masks A-i, L2 (being the realisation of first and second target change data blocks 120a and 120b in this embodiment) for the range images h and I2, respectively (see also below). A rectangle 109 marks the region also displayed in Figs. 6A-6FI (see also Figs. 5A-5D).

Figs 1A and 1B show range images for an exemplary scene. These range images are obtained from point clouds by projection. The range images represent range (i.e. depth) information: in the images of Figs. 1A-1B the darker shades correspond to farther parts of the illustrated scene, lighter shades corresponds to closer parts thereof, as well as those parts from which no information has been obtained are represented by black. Flerebelow, some details are given about the content shown in Figs. 1 A-1 B for deeper understanding.

In the 3D information data blocks 100a, 100b of Figs. 1A-1B some parts corresponding to identifiable objects are marked by reference numbers:

- a bus 101 a is marked in Fig. 1 A which become to a bus 101 b in Fig. 1 B (the bus become to another status by movement and it possibly turns, therefore seen shorter in Fig. 1B);

- there is a car 102a beside the bus 101a in Fig. 1A; in Fig. 1B a car 108b can be seen in another position, which can be another car or the same car as the car 102a;

- cars 105a, 106a - possibly parking - are on the same place in Fig. 1A as cars 105b, 106b in Fig. 1B;

- a car 103a can be seen in Fig. 1A which does not move (i.e. be the same as 103b in Fig. 1 B) but a car 104b appears on its right in Fig. 1 B;

- a car 107a of Fig. 1A cannot be seen in Fig. 1 B.

Figs. 1 C and 1 D show target change data blocks 120a and 120b (i.e. target change images) corresponding to data blocks 100a and 100b, thus the content of Figs. 1C- 1D can be interpreted based on the above content information. Fig. 1C is coloured (grey, not black; black is the background in Figs. 1C-1D) at those parts

- which are contained in Fig. 1 A but not in Fig. 1 B or

- which are at different distance. Accordingly, the followings can be observed in the target change images corresponding to target change data blocks 120a and 120b of Figs. 1C and 1D:

- the bus 101a moved thus the target change data block 120a shows the back of the bus 101a, i.e. a part 111a (this was contained in Fig. 1A but not in Fig. 1B); however, the target change data block 120b also marks this place as a part 111b, since - because of the movement of the bus 101 a - a car becomes visible (this was occluded by the bus 101a in Fig. 1A) and the background (e.g. a wall) behind the car;

- it can be interpreted in by a similar approach that a part 112a corresponding to the 102a of Fig. 1A is shown the target change data block 120a, as well as a background-like part 112b can be seen in target change data block 120b in the place from which the car 102a has moved (a coloured/grey part can also be seen in Fig. 1 D where the car 108b is visible in Fig. 1 B);

- similarly, a background-like part 114a can be seen in the target change data block 120a since the car 104b appears in Fig. 1 B, as well as a car-like part 117a can be seen in the target change data block 120a but a background-like part 117b can be seen in the target change data block 120b.

Next, our change detection task can be reformulated in the following way: our network extracts similar features from the range images h and I2, then it searches for the high correlation between the features, and finally, it maps the correlated features to two binary change mask channels Ai and L2, having the same size as the input range images.

About the so-called ChangeGAN architecture, see the followings.

For our purpose, we propose a new generative adversarial neural network (in particular generative adversarial neural network-like - abbreviated, GAN-like - architecture, more specifically a discriminative method, with an additional adversarial discriminator as a regularizer), called ChangeGAN, whose architecture (structure) is shown in Fig. 2 in an embodiment. Fig. 2 thus shows a proposed ChangeGAN architecture, wherein the notations of components: SB1, SB2 - Siamese branches, DS - downsampling, STN - spatial transformation/transformer network, Conv2D - 2D convolution; Conv2DT - transposed 2D convolution. By referring to GAN, GAN-like, we also mean the general characteristic of a GAN that its two competing networks (generator, discriminator) learn simultaneously during training, but at the same time, the generator can be considered as a result of the training procedure, since the training aims at creating a generator with appropriate characteristics, which is capable of generating an output with the desired characteristics.

Since the main goal is to find meaningful correspondences between the input range images h and I2 (in the embodiment of Fig. 2, first and second 3D information data blocks 150a and 150b are for example images with 128x1024x1 dimensions and forming a coarsely registered pair 150), in an embodiment we have adopted a Siamese style (see J. Bromley, J. Bentz, L. Bottou, I. Guyon, Y. Lecun, C. Moore, E. Sackinger, and R. Shah, “Signature verification using a ’’Siamese” time delay neural network,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 7, p. 25, 081993.) architecture to extract relevant features from the input range image pairs. The Siamese architecture is designed to share the weight parameters across multiple branches allowing us to extract similar features from the inputs and to decrease the memory usage and training time.

In the embodiment shown in Fig. 2 each branch 162a and 162b of the Siamese network comprises (in an example consists of) fully convolutional down-sampling (DS) blocks (i.e. DSi-DSn downsampling subunits 164a and 164b, the same machine learning units - i.e. branches 162a and 162b - are applied on the two inputs). The two branches 162a and 162b constitute a downsampling unit together. The first layer of the DS block is preferably a 2D convolutional layer with a stride of 2 which has a 2-factor down-sampling effect along the spatial dimensions. This step is followed by using a batch normalization layer, and finally, we activate the output of the DS block using a leaky ReLU function (these are not shown). Next, we concatenate the outputs of the Siamese branches for all feature channels (see merge unit 166; this concatenation brings in a new channel index to obtain two outputs (change data blocks 175a, 175b) for the two inputs (3D information data blocks 150a, 150b)), and we apply a preferably 1 x 1 convolutional layer to aggregate the merged features (see conv2D unit 168).

The followings are noted in connection with the latter (i.e. conv2D unit 168). The processing illustrated in Fig. 2 is a U-net-like processing, where the two “vertical lines” of the letter ‘U’ are the downsampling (it is a “double” vertical line because of the two branches) and the upsampling (upsampling unit 170, see also below). Furthermore, the “horizontal line” of the letter ‘U’ - connecting the two “vertical lines" thereof as the letter is drawn is illustrated by the conv2D unit 168. It is optional to have anything (any modules) in the “horizontal line”, but there may also be more convolution units (i.e. if any, one or more convolution units are arranged after merging but before upsampling).

The second part (see upsampling unit 170) of the proposed model contains a series of transposed convolutional layers (see Conv2DTi-Conv2DT_n upsampling subunits 172) to up-sample the signal from the lower-dimensional feature space to the original size of the 2D input images. Connections 167 interconnect the respective levels of downsampling and upsampling having the same resolution. Accordingly, in this U-net-like construction the Conv2DTi-Conv2DT_n upsampling subunits 172 for upsampling on the one hand receive input from a lower level (first from conv2D unit 168) and - as another input - the output of the respective downsampling level. Thus, the output of the Conv2DTi-Conv2DT_n upsampling subunits 172 is obtained using these two inputs.

Finally, preferably a 1 x 1 convolutional layer, activated with e.g. a sigmoid function, generates the two binary change maps Ai and L2 (in this embodiment first and second change data block 175a and 175b are for example images having 128x1024x1 dimensions). The change maps are in general considered to be the output of the upsampling unit 170, in which the upsampling is performed. From the branches 162a, 162b to the upsampling unit 170, a change detection generator module 160 is denoted in Fig. 2.

Preferably, to regularize the network and prevent over-fitting we use the Dropout technique after the first two transposed convolutional layers (not shown). As touched upon above, to improve the change detection result we have adapted an idea from U-net (see O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Int. Conf. Medical Image Computing and Comp. -Ass. Intervention, 2015, pp. 234-241.) by adding higher resolution features from the DS blocks to the corresponding transposed convolutional layers. The branches of the Siamese network can extract similar features from the inputs. In our case, as the point clouds are coarsely registered, the same regions of the input range images might not be correlated with each other. Preferably, to achieve more accurate feature matching (i.e. it is preferably added for effective processing) we have added Spatial Transformation Network (STN) blocks (see M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” Advances in Neural Information Processing Systems (NIPS), 2015.; in the followings, STN article) for both Siamese branches (see Fig.2 for STN modules 165a, 165b). STN can learn an optimal affine transformation between the input feature maps to reduce the spatial registration error between the input range images. Furthermore, STN dynamically transforms the inputs, also yielding an advantageous augmentation effect.

Accordingly, depending on that how large transformation and/or rotation being present between the images (generally, data blocks) of the coarsely registered image pair (generally, data block pair) it may be preferred to apply an STN module. Accordingly, in an embodiment of the training method a spatial transformer module (see spatial transformer modules 165a, 165b in the embodiment of Fig. 2 arranged in both branch of change detection generator module illustrated there)

- based on machine learning and

- adapted for helping in processing any translation and/or rotation between the first 3D information data block and the second 3D information data block corresponding to a coarsely registered pair is applied comprised in the change detection generator module.

So the STN module (configured according to the description of the STN article cited above) is built in between the DS modules (at the same level in the two branches, of course), because it can work well on the downsampled images, helping to handle possible transformations in the inputs. So, in upscaling there is no STN module arranged, but it learns end-to-end to give good change data blocks (change images) for the inputs relative to the corresponding targets. The STN module is thus preferably a part of the change detection generator module, i.e. it is trained in the framework of the end-to-end training. Accordingly, the upsampling unit and the interconnections between the upsampling and downsampling units helps to integrate the STN module.

The STN module, as disclosed above, is therefore designed to help handle the translations and rotations that may be present in the coarsely registered pair. It is also important to emphasize that it does this while learning, together with the other modules, to produce the right change data blocks (showing such possible translation/rotation) for the right inputs. By this we mean helping to process the translation and/or rotation (and meanwhile the whole generator module does not eliminate them, but this registration error is also present in the change data blocks).

The STN module is of course also part (inseparable part) of the trained (completed) generator.

The STN module that is included in the generator as above only helps the operation of this generator, i.e. it is preferably more efficiently learned if included, but arranging of the STN module is not necessarily required.

In an embodiment (as illustrated in Fig. 2)

- a downsampling unit (see downsampling unit 162 in Fig. 2; in the embodiment of Fig. 2 the downsampling unit has (directly) the inputs, i.e. the coarsely registered pair) having a first row of downsampling subunits (see downsampling subunits 164a, 164b) and interconnected with

- an upsampling unit (see upsampling unit 170, in the embodiment of Fig. 2 a merge unit 166 - because of the two branches 162a, 162b - and a conv2D unit 168 is inserted into the interconnection of these; furthermore, in the embodiment of Fig. 2 the change data blocks are obtained directly as outputs of the upsampling unit) having a second row of upsampling subunits (see upsampling subunits 172) and corresponding to the downsampling unit, is applied, wherein the downsampling unit and the upsampling unit is comprised in the change detection generator module and the spatial transformer module is arranged in the downsampling unit within the first row of downsampling subunits (i.e. inserted between two downsampling subunits; units may also be called modules). It is summarized with emphasize that the above combination of the U-net-like approach and the arrangement of the STN module is highly advantageous. The optional character of the arrangement of the STN module is also emphasized: the STN module is optionally used for helping the change detection generator module in its operation, but it can operate (perform its task) also without arranging STN module.

STN preferably works with 2D images. So does preferably the invention, since 3D point clouds can be preferably represented as 2D range images (however, the application of both for 3D inputs can be straightforwardly solved). The position of the STN module within the feature extraction branch is preferably "in the middle" (among downsampling subunits) because it is already preferably looking for transformations among more abstract features, not on raw high-resolution data.

A clear difference between the proposed change detection method (i.e. the training method and the corresponding change detection system) and the state-of-the-art is the adversarial training strategy which has a regularization effect, especially on limited data. The other main difference is the preferably built-in spatial transformer network (in the training method) yielding the proposed model to be able to learn and handle coarse registration errors. As detailed throughout the description, the building blocks of the training method and thus the corresponding change detection system are combined so that a highly advantageous synergetic effect is achieved, e.g. by the adversarial training strategy itself, and even more with the application of STN module. Utilizing the STN layer (in other words an STN module comprising the STN layer), the model can automatically handle errors of coarse registration (i.e. any translation and/or rotation in the coarsely registered pair).

In the following the training of ChangeGAN is described (i.e. an embodiment of the training method according to the invention is illustrated). A competitive generator - discriminator-based training was implemented for the ChangeGAN network.

The generator network is responsible for learning and predicting the changes between the range image pairs. In each training epoch, the generator model is trained on a batch of data. The number of epochs is a hyperparameter that defines the number times that the training (learning) algorithm will work through the entire training dataset. The actual state of the generator is used to predict validation data which is fed to the discriminator model.

The discriminator network is preferably a fully convolutional network that classifies the output of the generator network. The discriminator model preferably divides the image into patches and decides for each patch whether the predicted change region is real or fake. During training, the discriminator network forces the generator model to create better and better change predictions, until the discriminator cannot decide about the genuineness of the prediction.

Fig. 3A demonstrates the proposed adversarial training strategy in an embodiment, i.e. of the ChangeGAN architecture. We preferably calculate the L1 Loss (abbreviated, Lu; more generally generator loss contribution 225 in Fig. 3A) as the mean absolute error between the generated change image and the target change image (generally, data blocks), and we define the GAN Loss (abbreviated, LGAN; it may also be called adversarial loss, more generally discriminator loss contribution 230 in Fig. 3A), which is preferably a sigmoid cross-entropy loss of the feature map generated by the discriminator and an array of ones. The final loss function of the generator (abbreviated, L_gen; in other name combined loss 235) is the weighted combination of the GAN Loss and the L1 Loss: L_gen = LGAN+A* LLI. Based on our experiments we set l = 300 in an example.

Fig. 3A furthermore illustrates in the schematic flowchart the followings as a part of an embodiment of the training method. A first 3D information data block 200a and a second 3D information data block 200b is denoted by respective blocks labelled by Inputi and Input2. There is also a target change data block 205 (labelled by “Target”, as mentioned below, this is preferably an illustration of a target change data block pair). The first 3D information data block 200a and the second 3D information data block 200b are the inputs of a change detection generator module 210 (labelled by “Generator network”, since it is realized by a neural network module in this embodiment). Accordingly, at the output of the generator it is checked, how good the change was generated compared to the target.

The change detection generator module 210 have a change data block 215 (labelled by “Generated img (image)”; this is preferably an illustration of a change data block pair) as an output. The change data block 215 as well as the target change data block 205 is processed to obtain the L1 loss 225.

Furthermore, the change data block 215 and the target change data block 205 are given also to the discriminator module 220 (labelled by “Discriminator network”, since it is realized by a neural network module in this embodiment), just like the first 3D information data block 200a and the second 3D information data block 200b. All of these inputs are processed by the discriminator module 220 (see also below) so as to obtain the combined loss 235. As it is also illustrated in Fig. 3A the combined loss 235 is fed back to the change detection generator module 210 preferably as a gradient update.

As also mentioned above the discriminator loss contribution is generated based on coarsely registered pairs, as well as corresponding target change data blocks and change data blocks. Naturally, in a training process, the change data block just generated by the generator module has a special role, since the other inputs of the discriminator module are parts of the ground truth. According to its general role, the discriminator makes a “real or fake” decision on the change/target change data blocks (images).

According to the adversarial training strategy, the generator and the discriminator “compete” with each other. The goal of the generator is to generate better and better results for which the discriminator can be “persuaded” that it is not a generated result (“fake”) but a “real” one. On the contrary, the discriminator learns to recognize the generated images better and better during the learning process. Thus, by the help of their common learning process it can be achieved that generator generates high- quality change data blocks (images). After the training (learning) process, only the generator will play a role (it will generate change data block(s) for an arbitrary input pair), but the discriminator has only a role during the learning process.

Fitting to the above goals, the training (learning) process of the discriminator is performed in the present case as follows. From the point of view of the discriminator a target change data block is “real” and a change data block (generated by the generator) is “fake”. During the learning process of the discriminator, the coarsely registered pair (e.g. the input range images, but this can be made also operable if the inputs are point clouds: the modules can also be constructed so as to handle this type of input) is accompanied by a target change image and a generated change image on the input of the discriminator module. The discriminator preferably has separated inputs for a target change image and a coarsely registered pair, and for a (generated) change image and the coarsely registered pair (the same as given along with the target image; the target and generated change images correspond also to each other; these corresponding inputs constitute an input set for the discriminator). The discriminator will judge about both inputs. Accordingly, the discriminator decides about the target or generated change image having knowledge about the content of the 3D information image pair.

For these inputs, the discriminator generates outputs preferably on a pixel-to-pixel basis whether, according to the judgement of the discriminator, an image part is real or fake. For the whole image, thus, discriminator will give a map illustrating the distribution of its decision.

As applied in the invention, a discriminator loss is generated based on the outputs (corresponding to the two inputs mentioned above) of the discriminator, which is used to train the discriminator itself as well as - a part of the combined loss mentioned above - also the generator. During the training method (i.e. the learning process), both the generator and the discriminator is trained continuously, but after training only the generator will be used (as mentioned above, for generating change images for an input pair) and the trained discriminator is not utilized.

Based on the outputs of the discriminator, the discriminator loss is preferably calculated as follows. For obtaining the contribution of an output pair for the discriminator loss, it is known what type of input has been given to the discriminator.

In case of a generated change image, the good result is when the image is decided to be “fake”, this is e.g. denoted by a Ό’ (and a “real” pixel is denoted by a ; preferably makes a binary decision for each pixel, this is a kind of classification task) in the pixels of the respective output of the discriminator. This would be the ideal result of the discriminator. Flowever, the discriminator is preferably able to judge on a pixel-to-pixel basis, i.e. it can give a result for each pixel. Moreover, the discriminator preferably gives a number from the range [0,1] in which case it can give also a probability for the judgement for each pixel (the output in the [0, 1 ] range can be cut at 0.5, and thus binary results can be reached, sorting every result under 0.5 to 0 and every result equal to or larger than 0.5 to 1). The case is typically not ideal during the learning process, therefore, the result coming from the discriminator will be diversified.

In order to determine the discriminator loss contribution of that output (which was assumed to be a generated change image) a correlation-like function (e.g. sigmoid cross entropy) is applied to the output of the discriminator and an image with same dimensions being full of Ό’ (this is similar also in the case below).

In case of a target change image, it should be judged as ‘real’ with an ideal result full of . Accordingly, for this type of input of the discriminator, the correlation-like function is applied for the output and a same-dimensioned image full of ‘T.

The discriminator loss thus guides the generator to generate images closer to the ground truth, because it will only accept them as 'real' if it gets the ground truth itself, so if an incoming generated change image resembles the ground truth in as many details as possible, we are doing well.

Based on the spatial structure presented by the inputs bearing 3D information, the discriminator examines the imperfect result of the generator and can judge, not only at the mask level but also considering also the 3D information of the input pair, what level of learning the generator is at and can give an answer at the pixel level (relevance plays a big role here), where it is good and where it is not (it can accept roughly good on irrelevant parts, but expect more on a relevant part to judge that part positively).

Accordingly, very good quality is assured on the output of the generator if it passes the test of the discriminator relative to the target, which is also analysed against the inputs by the discriminator.

The discriminator loss is built up from such loss contributions (generated using the target change data block as well as the (generated) change data block, i.e. the discriminator loss receives two contributions according to the two outputs of the discriminator for each input set thereof) and after an epoch, the discriminator will be trained by this discriminator loss, as well as it is taken to the combined loss to train also the generator.

Accordingly, a discriminator loss contribution is preferably generated by applying the discriminator module on a plurality of a corresponding

- first type combination of a change data block and a coarsely registered pair of the training set, and

- second type combination of a target change data block of the training set and the coarsely registered pair (being the same as in the first type combination) of the training set.

According to the above these latter are the content of a single input set for the discriminator and a plurality of such single input sets are applied during the training.

In an exemplary test both the generator and the discriminator part of the GAN architecture were optimized by the Adam optimizer (Diederik P. Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization, arXiv:1412.6980v9) and the learning rate was set to 10⁵ (learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function; Murphy, Kevin P. (2012). Machine Learning: A Probabilistic Perspective. Cambridge: MIT Press p. 247). In this test we have trained the model on 300 epochs which takes almost two days. At each training epoch, we have updated the weights of both the generator and the discriminator ones.

We note here, that the ChangeGAN method (i.e. here the training method according to the invention) can be trained without the adversarial loss (GAN loss), relying only on L1 loss. In our preliminary experiments, we followed this simpler approach, which was able to predict some change regions, but the results were notably ambiguous. To increase the generalization ability, we applied the adversarial training strategy in the proposed final model.

In the following, the change detection dataset is described (training set generated by the training set generating method according to the invention). This is in connection with the training set generating method to which some of the embodiments of the invention relate. Although the training set generating method is on the same generality level as the training method (relates also coarsely registered pairs and the training set generating method is for generating a plurality of coarsely registered pairs of a first 3D information data block and a second 3D information data block and a plurality of respective target change data blocks), it is mainly described in the following illustrated on the example of point clouds for 3D information data block (it is summarized also on the most general level).

Considering that the main purpose of the presented ChangeGAN method is to extract changes from coarsely registered point clouds (generally, coarsely registered pairs of 3D information blocks), for model training and evaluation we need a large, annotated set of point cloud pairs collected in the same area with various spatial offsets and rotation differences (which data is thus coarsely registered).

Following our change definition in the above detailing of the proposed method, the annotation should accurately mark the point cloud regions of objects or scene segments that appear only in the first frame, only in the second frame, or which ones are unchanged thus observable in both frames (see Figs. 4A-4E and 6A-6FI).

Since the available point cloud benchmark sets cannot be used for this purpose (no database of coarsely registered point clouds is available), we have created a new Lidar-based urban dataset called Change3D (see Dataset link: http://mplab.sztaki.hu/geocomp/Change3D.html; see below the details of an embodiment). Our measurements were recorded in the downtown of Budapest, Flungary on two different days by driving a car with a Velodyne HDL-64 rotating multi-beam Lidar attached to its roof. To our best knowledge, this Change3D dataset is the largest point cloud dataset for change detection, which contains both registered and coarsely registered point cloud pairs.

1) Ground truth creation approach: Since manual annotation of changes between 3D point clouds is very challenging and time-consuming, we propose a semi automatic method using simulated registration errors to create ground truth (GT) for our change detection approach. Performing the training method according to the invention on the change detection system, our tests show that it becomes effective in identifying changes that occur in real life. To ensure the accuracy of the GT, we performed the change labelling for registered point cloud pairs captured from the same sensor position and orientation, then we randomly transformed the reference positions and orientations of the second frames (any/both frames can be transformed, see below for the details) yielding a large set of accurately labelled coarsely registered point cloud pairs. Thereafter, this set has been divided into disjunct training and test sets which could be used to train and quantitatively evaluate the proposed method.

The remaining parts of the collected data including originally unregistered point cloud pairs have been used for qualitative analysis through visual validation (see for example Figs. 4A-4E) of the model performance (see below where data pairing based on GPS position measurement data is disclosed).

2) Core data creation for GT annotation: In our applied procedure (i.e. the applied number are concrete numbers of an exemplary procedure), we selected 50 different locations during the test drive when the measurement platform was motionless for a period: it was stopped by traffic lights, crossroads, zebra crossings, parking situations, etc. These locations were taken both from narrow streets from the downtown and wide, large junctions as well. At each location, we took 10 recorded point clouds, and then we randomly selected 400 point cloud pairs among them, obtaining for the 50 locations a total number of 20,000 (twenty thousand) point cloud pairs on which the training set was based. The test set is based on 2,000 (two thousand) point cloud pairs, which were selected similarly, but in terms of locations and recording time stamps, the test samples were completely separated from the training data (i.e. these two were separated in the data).

In these recordings, the differences among the point clouds were only caused by the moving dynamic objects such as vehicles and pedestrians. Alongside the exploitation of real object motion and occlusion effects, some further artificial changes have been preferably synthesized by manually adding and deleting various street furniture elements to selected point cloud scenes. Also, we segmented the point clouds roughly to planes (see A. Bores, B. Nagy, and C. Benedek, “Fast 3-D urban object detection on streaming point clouds,” in ECCV 2015 Workshops, LNCS, 2015.), and randomly deleted some selected 2D rectangular segments. 3) Semi-automatic change extraction: Since the above-discussed frame pairs are taken in the same global coordinate system, they can be considered as registered. Their ground truth (GT) change annotation can be efficiently created in a semi automatic way: A high-resolution 3D voxel map was built on a given pair of point clouds. The voxel size defines the resolution of the change annotation. The length of the change annotation cube (voxel) was set to 0.1 m in all three dimensions. All voxels were marked as changed if 90% of the 3D points in the given voxel belonged to only one of the point clouds. Thereafter minor observable errors were manually eliminated by a user-friendly point cloud annotation tool. Finally, in both point clouds, all points belonging to changed voxels received a p^GT(p) = ch GT labels, while the remaining points were assigned to p^GT(p) = bg labels.

4) Registration offset: To simulate the coarsely registered point cloud pairs requested by our ChangeGAN approach, we have applied randomly an up to ±1m translation and an up to ±10° rotation transform around the z-axis for the second frame (P2; here, the second frame was chosen) of each point cloud pair in this example both in the training and test datasets. The p^GT(p) GT labels remained attached to the p e P2 points and were transformed together with them (this would also hold true when both of the frames would be transformed, but preferably the resultant or sum of the transformation does not fall outside our maximum range given above).

5) Cloud crop and normalization: In the next step, all 3D points were preferably removed from the point clouds, whose horizontal distances from the sensor were larger than 40m, or their elevation values were greater than 5m above the ground level. This step yielded the capability of normalizing the point distances from the sensor between 0 and 1.

6) Range image creation and change map projection: The transformed 3D point clouds were projected to 2D range images h, and I2 as described in connection with range image representation above (see Figs. 1A-1B). The Lidar's horizontal 360° field of view was mapped to 1024 pixels and the 5m vertical height of the cropped point cloud was mapped to 128 pixels, yielding that the size of the produced range image is 1024 x 128. We note here that our measurements were recorded at 20 Hz, where the angular resolution is around 0.3456°, which means we get 1042 points per channel per revolution. In general, for more efficient training we considered that the dimension of the training data is a power of two. So we removed 18 points with equal step size from each channel. Since the removed points are in fixed positions, we know the exact mapping between the 2D and 3D domain.

We also note here that the Lidar sensor used in this experiment has 64 emitters yielding that the height of the original range images should be 64. However, to increase the learning capacity of the network we have doubled and interpolated the data among the height dimension since the 2D convolutional layers with a stride of 2 have a 2-factor down-sampling effect. Let us observe that the horizons of the range images are at similar positions in the two inputs due to the cropped height of the input point clouds. Besides the range values, the p^GT(p) ground truth labels of the points were also projected to the Ai^GT and A2^GT change masks, used for reference during training and evaluation of the proposed network.

Thus, the above can be summarized as follows. For model training and evaluation, we need a large, annotated set of point cloud pairs collected in the same area with various spatial offsets and rotation differences (see the details below how can we reach the annotation of these, as they cannot be directly annotated: we work with point clouds fixed in the same place and preferably induce the differences manually). The annotation should accurately mark the point cloud regions of objects or scene segments that appear only in the first frame, only in the second frame, or which ones are unchanged thus observable in both frames.

Since manual annotation of changes between 3D point clouds is very challenging and time-consuming, a semi-automatic method using simulated registration errors to create ground truth (GT) is proposed for (i.e. applicable for) the change detection method according to the invention. See the following aspects in this respect.

As a relevant and highly advantageous aspect, to ensure the accuracy of the GT, the change labelling is performed for registered point cloud pairs (generally, registered 3D information data block pairs, which called registered base pairs) captured from the same sensor position and orientation at different times (e.g. captured from a car standing at the same place, that is why the point clouds will be registered). Since production of the GT starts from registered point cloud pairs (in general, 3D information data block pairs), it is naturally known where the changes are in the images.

1. In these recordings, the differences among the point clouds were only caused by the moving dynamic objects such as vehicles and pedestrians (i.e. , from those objects changing/changing their places between the time instances in which the respective point clouds were recorded).

2. Alongside the exploitation of real object motion and occlusion effects, some further artificial changes are preferably synthesized by manually adding and deleting e.g., various street furniture elements to selected point cloud scene (see also below).

3. Also, we segmented the point clouds roughly to planes (see A. Barks, B. Nagy, and C. Benedek, “Fast 3-D urban object detection on streaming point clouds” in ECCV 2015 Workshops, LNCS, 2015.), and randomly deleted some selected 2D rectangular segments. So here, we have preferably deleted parts of planes (e.g. a phone box or a facade disappears). This way, the change detection system will not only be able to handle changes such as a vehicle passing by, but also, for example, if a building has been demolished or street furniture has been removed of the detected planes, simulating a removal of a building fagade, etc.

4. As mentioned above, the change annotation is performed on registered point clouds a. The frame pairs are taken in the same global coordinate system, they can be considered as registered. b. Preferably, their ground truth (GT) change annotation can be efficiently created in a semi-automatic way: i. A high-resolution 3D voxel map is built on a given pair of point clouds (e.g. voxels with 10 cm edges for a general scene, and with these preferably cubic, matching voxels, we cover the scene so that all points of the point cloud are contained within a voxel and a plurality of points may be contained in a voxel). ii. All voxels were marked as changed if 90% of the 3D points in the given voxel belonged to only one of the point clouds (i.e., preferably, the voxels themselves are marked). As another relevant aspect, then the reference positions and orientations of e.g. the second frames are randomly transformed (any of the frames may be transformed or both of them) yielding a large set of accurately labelled coarsely registered point cloud pairs (the pairs were registered so far, the transformation is performed so as to achieve coarsely registered pairs). In an example: a. Randomly an up to ±1m translation and an up to ±10° rotation transform around the z-axis has been applied for the second frame (P2) of each point cloud pair both in the training and test datasets. b. As a general aspect, it is noted that the GT labels remained attached to the p e P2 points and were transformed together with them (i.e. the annotation advantageously remains valid also after applying the transformation). Preferably, cloud crop and normalization steps are also performed: a. In the next step, all 3D points were removed from the point clouds, whose horizontal distances from the sensor were larger than 40m, or their elevation values were greater than 5m above the ground level. b. This step yielded the capability of normalizing the point distances from the sensor between 0 and 1 (these values are preferred for e.g. neural networks). After that, range image creation and change map projection is performed: a. The transformed 3D point clouds were projected to 2D range images 11, and I2. By the help of a transformation, it is determined for a 3D point of the point cloud where it is projected. The change values (there is change in a point or not) for each point belong to the point, so a change data block (preferably change mask image) can also be created with the projection. In the projection, the voxel map is not considered (the projection is not based on it). b. In an example, the Lidar’s horizontal 360° field of view was mapped to 1024 pixels and the 5m vertical height of the cropped point cloud was mapped to 128 pixels, yielding that the size of the produced range image is 1024 ^c 128. c. To increase the learning capacity of the network we have doubled and interpolated the data among the height dimension since the 2D convolutional layers with a stride of 2 have a 2-factor down-sampling effect (we have 64 lasers, so our image size would be 1024^*64, which is too small for a series of convolutions; so we scaled it up to double).

8. Thereafter, the generated GT set has been preferably divided into disjunct training and test sets which could be used to train and quantitatively evaluate the proposed method.

9. The remaining parts of the collected data including originally unregistered point cloud pairs have been used for qualitative analysis through visual validation. These are based on real measurements taken e.g. in a city, for example. We paired them afterwards based on nearby GPS position measurement data, so they are not like those recorded as a pair that can be considered as a registered pair. We could only check this visually and the trained system performed very well on these.

As a result, we got a dataset (training set) containing corresponding unregistered point cloud pairs with annotated changes on them.

In accordance with the above, some embodiments of the invention relate to a training set (mentioned also as dataset and it may also be called training database) generating method for generating a plurality of coarsely registered pairs of a first 3D information data block and a second 3D information data block and a plurality of respective target change data blocks (in this case preferably point clouds) of a training set for applying in any embodiments of the training method according to the invention. In the following, reference is made to Fig. 3B which shows a flowchart of the main steps of an embodiment of the training set generating method according to the invention.

In the course of the training set generating method

- a plurality of registered base pairs (see registered base pairs 300 in Fig. 3B) having (alternatively, of) a first base 3D information data block and a second base 3D information data block is generated or provided (it may be - readily - available e.g. from some database, but these can be generated also as given in the example detailed above; as a result of these options, we can start this method with a plurality of registered base pairs),

- change annotation is performed in a change annotation step (see operational step S310 in Fig. 3B) on the first base 3D information data block and the second base 3D information data block of each of the plurality of registered base pairs,

- by transforming in a transformation step (see operational step S320 in Fig. 3B) at least one of the first base 3D information data block and the second base 3D information data block (as a first option, only one of them is transformed, but even both of them may be transformed, but it is irrelevant which is transformed) of each of the plurality of registered base pairs, a first resultant 3D information data block and a second resultant 3D information data block are generated for each of the plurality of registered base pairs (resultant 3D information data blocks may have another name, like e.g. intermediate 3D information data blocks), and

- in a training data generation step (see operational step S330 in Fig. 3B) the plurality of coarsely registered pairs and the plurality of respective target change data blocks of the training set are generated based on a respective first resultant 3D information data block and second resultant 3D information data block.

The order of the steps is fixed also in this case, since inputs and outputs are used in the consecutive steps: change annotation is made on registered base pairs, after that transforming is applied onto at least one of them to generate the resultant 3D information data blocks, based on which the content of the training set is generated.

In an embodiment - in line with options in points 2 and 3 of the list above - one or more artificial change is applied before the change annotation step by addition or deletion to any of the first base 3D information data block and the second base 3D information data block of each of the plurality of registered base pairs. Artificial changes are applied on the base 3D information data blocks (e.g. point clouds), so these modified data blocks are forwarded to the change annotation step afterwards.

In an embodiment (see point 4 b.ii. of the list above, especially for the further specialized variation of this embodiment), the plurality of registered base pairs of a first basis 3D information data block and a second base 3D information data block are constituted by a (respective) plurality of registered base point cloud pairs of a first base point cloud having a plurality of first points and (a respective plurality of) a second base point cloud having a plurality of second points (i.e. in this embodiment point clouds are utilized), and in the change annotation step

- a 3D voxel grid (map) having a plurality of voxels is applied on the first basis point cloud and the second basis point cloud (i.e. the 3D voxel grid is applied to the union of the two point clouds, more specifically to the space part in which the first and second point clouds are arranged (situated)),

- for each voxel (i.e. considering each voxel one by one) assigning change to every first point comprised in a voxel and to every second point comprised in that voxel (i.e. to the same voxel which is just under investigation) in which o a first voxel point ratio of those first points of the first base point cloud not having a corresponding second point in the second base point cloud (correspondence can be easily checked, since at this point the point clouds are registered) and a first total number of first points is above a predetermined first ratio limit, and o a second voxel point ratio of those second points of the second base point cloud not having a corresponding first point in the first base point cloud and a second total number of second points is above a predetermined second ratio limit, respectively, and

- a first target change data block and a second target change data block are generated based on assigned change for first points and second points, respectively (since a change label is assigned to all points where it is judged that there is a change, a target change data block can be easily produced based on the change assignment information).

Furthermore, preferably, the predetermined first ratio limit and the predetermined second ratio limit is both 0.9 (both of them is set to be this value).

According to the above, a common voxel grid (voxel map) is built (assigned, generated) for the two point clouds. We may drop ’3D’ from the name of the voxel grid, but this attribute shows that the voxel grid is not built in a plane but in space. It is noted that it can be solved to not taking voxels to those space parts where there are no points of the point clouds (so it is not necessary to have a voxel grid for a rectangle-based cuboid).

Accordingly, in the step of change assignment, it is given how it is judged that all points of a voxel get a change label (i.e. it is not given when these receive a non change label, since the points of a voxel get non-change label if the change label has not been assigned to them). Alternatively termed than above: the points of those voxels get the change label in which there are too much of such points which have no correspondent in the respective voxel of the other point cloud (since this is analysed before the transformation is done, the correspondent points in the point cloud pair can be easily counted, and no-correspondence is found at those position into which a change has been induced by dynamic changes or by hand, see points 1-3 of the list above).

The judgement is always based on the ratio:

- when there is a large number of points in a voxel, it is easy to judge whether the ratio is reached or not;

- when there is a small number of points in a voxel, the ratio can be reached but it is easy to fail the ratio because only some correspondence between the points is enough to fall under the ratio;

- if there are no points in a voxel, then there will be no change (there is no points for being labelled by a change label).

With an appropriately sized voxel (e.g. 0.1 m for a general scene, as specified above), a scene will be partitioned correctly, i.e. as you move from a non-change area to a change area, it will not indicate a change early, because the points in the voxel from the contribution of the two point clouds will largely match. When it arrives at a change area, then it derives the change areas with a resolution comparable to the voxel, but not point by point, rather it determines much more efficiently (with a precision that is well within the accuracy of the result, given that for a street scene with a voxel size of e.g. 0.1 m is preferably chosen) which volume parts have change and which do not.

Where there will be change, it should be strongly different anyway, since the threshold is preferably set to 90% (this corresponds to 0.9 for the ratio limit). This way, the voxels will have the right resolution at the transition to the change areas, and the volume parts affected by the real change can be well identified. Depending on the ratio limit (which is therefore preferably 0.9, i.e. 90%), if the result for a given voxel is above the ratio limit, all points in the voxel are classified as belonging to change, while if it is below it, all points in the voxel are classified as not belonging to change.

It is clear that if the voxel size were too small or if we treated the points piecewise, it would be too complex and confused, and we would expect to see many isolated changes, which would be lost in a sea of non-change points.

If the voxel size were too large, the mapping would probably be very "steplike” and “low resolution-like”, on the other hand, there would certainly be only a few changes, because large voxels are more likely to have similarities, so it is easy to miss - i.e. to remain under -the threshold (e.g. 90%).

In an embodiment of the method for generating a training set, in the transformation step randomly an up to ±1m translation (preferably in a plane perpendicular to the z-axis) and/or an up to ±10° rotation transform (preferably around a z-axis, see also below in connection with the z-axis) is applied. These are the limits which can be preferably applied in those preferred cases when the STN module is applied. Accordingly, these values give a preferred restriction to the concept of coarsely registered pair.

It is noted that the above mentioned transformations being preferably connected to the z-axis, which can be most effectively handled by the preferably applied STN module. Another type of transformations might be more difficult to handle.

In the following, the performed experiments are described. We have trained and evaluated the proposed method using the new Change3D dataset (see the above details of the change detection dataset), which contains point cloud pairs recorded by a car-mounted RMB Lidar sensor at different times in dense city environments. For a selected coarsely registered point cloud pair, Figs. 4A-4E show the changes predicted by the proposed ChangeGAN model, i.e. results obtained by (an embodiment of) the trained change detection system according to the invention (there are also such results among Fig. 5A-5D and Fig. 6A-6FI, which were obtained by this).

About the reference methods, see the followings. To our best knowledge, we cannot find in the literature any reference methods focusing on change detection in coarsely registered terrestrial point clouds. Flowever, since we reformulated in an embodiment the 3D change detection problem in the 2D range image domain, image-based methods tolerant of registration errors can also be taken into consideration for comparison.

As the first baseline, we have chosen the ChangeNet method (see A. Varghese, J. Gubbi, A. Ramaswamy, and P. Balamuralidhar, “Changenet: A deep learning architecture for visual change detection,” in ECCV 2018 Workshops, LNCS, 2019, pp. 129-145.), which is a recent approach for visual change detection, being able to detect and localize changes even if the scene has been captured at different lighting, view angle, and seasonal conditions. ChangeNet uses a ResNet (K. Fie et al. : „Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778) backbone, working with fixed-size input images (224 x 224). Our created exemplary range images could not be given directly to this network, since their resolution (1024 x 128) and aspect ratio parameters are different. This issue was solved by splitting our range images into eight 128 x 128 parts, which were upscaled to the image size required by ChangeNet. We used the genuine and published implementation of the ChangeNet architecture, which was trained using our training data set described above in connection with the change detection dataset.

Our second reference method follows a voxel occupancy-based approach (see B. Galai and C. Benedek, “Change detection in urban streets by a real time Lidar scanner and MLS reference data,” in Int. Conf. Image Analysis and Recognition, LNCS, 2017, pp. 210-220.), where the detection accuracy and the ability to compensate minor registration errors depend on the chosen voxel resolution. As a core step of the algorithm, the above approach of B. Galai and C. Benedek applies a registration method between the point cloud pairs. For noise filtering and registration error elimination, a Markov Random Field (MRF) model is adopted which is defined in the range image domain in this approach. Comparative results of the proposed method and the reference techniques for the point cloud pair of (corresponding to) Figs. 1 A-1 B are shown in Figs. 5A-5D in range image representations. Accordingly, Figs. 5A-5D show (predicted) change masks by the different methods on input data shown in Figs. 1 A-1 B. More specifically, Fig. 5A shows ground truth fused change map A^GT (this one is not a predicted mask), Fig. 5B shows ChangeGAN output’s fused change map L, Fig. 5C shows ChangeNet output, and Fig. 5D shows MRF output. Rectangles 109 correspond to the region shown in Figs. 6A-6FI.

The above mentioned fused change maps (i.e. fused change images) can be interpreted in the following way. Figs. 1 A and 1 B show relative mask, i.e. the change in view of another image. It is disclosed in connection with them how their content is determined. Compared to these, the fused change maps show every change, i.e. not only the relative changes: if there is a change in any of the images of the image pair, there is a change shown in the fused image. To interpret further, changes can be represented by masks. In a first change image of a pair a change can be denoted by in a pixel, if there is no change, there is Ό’ in the pixel. Furthermore, in a second change image of the pair a change is denoted by ‘2’ in the respective pixel (no change remains O’). In a fused change image, change is denoted in all of that pixels which contain 'T in the first change image or ‘2’ in the second change image.

It is noted here, that in the context of the invention, it can be chosen to define only one of the two changes, i.e. it is sufficient to determine a change for a single "direction". According to the above, in the fused change image, however, changes for both "directions" are defined, but their information content is mapped to a single image.

Investigating the ground truth in Fig. 5A (generally, a target change data block 260) and the result of ChangeGAN in Fig. 5B (a fused change image 280), it can be clearly seen that ChangeGAN gives a very good result.

It is very clear that ChangeNet does not yield good results (see reference image 290 in Fig. 5C). The rectangle 109 is highlighted but the results are of inferior quality on the whole image 290. Reference image 295 in Fig. 5D illustrates that MRF yields many false positive detections, i.e. it “says” for too much part that it is a change which is not advantageous. Numerical results take these false positive detections into account (see also Tablel below).

Since neither the ChangeNet nor the MRF methods can distinguish changes by objects of the first and second images, for a direct comparison, we also binarized the output of ChangeGAN to get a fused change map A where vs e S: A(s) = max(Ai(s), A2(s)). The fused GT mask ^GT was similarly derived.

Flerebelow, quantitative results are disclosed.

We evaluated the proposed ChangeGAN method (i.e. the training method according to the invention) and the two baseline techniques on our new Change3D benchmark set. The quantitative performance analysis was performed in the 2D range image domain, using the fused A^GT mask as a GT reference. To measure the similarity between the binary GT change mask and the binary change masks predicted by the different methods, mean F1 -score, Intersection over Union (loU) were calculated alongside pixel-level precision, recall, and accuracy. The used metrics' definition follows a standard binary classification metrics (see C. E. Metz, “Basic principles of roc analysis,” Seminars in Nuclear Medicine, vol. 8, no. 4, pp. 283-298, 1978., https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient and David M. W. Powers: “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation”, arXiv:2010.16061 v1).

The numerical evaluation results obtained by MRF method (i.e., MRF-based reference approach, see B. Galai and C. Benedek, “Change detection in urban streets by a real time Lidar scanner and MLS reference data,” in Int. Conf. Image Analysis and Recognition, LNCS, 2017, pp. 210-220.), ChangeNet method (A. Varghese, J. Gubbi, A. Ramaswamy, and P. Balamuralidhar, “Changenet: A deep learning architecture for visual change detection,” in ECCV 2018 Workshops, LNCS, 2019, pp. 129-145.), and the proposed ChangeGAN method over the 2,000 range image pairs of the test dataset, are shown in Table 1, which is a performance comparison of these methods.

About the parameters in Tablel (TP - true positive, FP - false positive, TN - true negative, FN - false negative):

- accuracy is the proportion of correct predictions (both true positives and true negatives) and the total number of cases examined;

- precision shows how much was good from all of the marked results (i.e. from the marked changes); expression: precision=TP/(TP+FP);

- recall shows how much was found from all of the good results (i.e. how much was marked from all of the changes); expression: recall=TP/(TP+FN); - F1-score=2*precision*recall/(precision+recall), it is the harmonic average of precision and recall;

- loU (intersection over union, i.e. loU=intersection/union): the intersection (those changes which have been found) is divided by the union (i.e. the union of the marked changes and all of the changes; accordingly in the union all marked changes - marked solution - and all target changes - ideal solution - are taken into account which has naturally an overlap: those changes which have been found).

As demonstrated, the ChangeGAN method outperforms both reference methods in terms of these performance factors, including the F1 -score and loU values. The MRF method (B. Galai and C. Benedek, “Change detection in urban streets by a real time Lidar scanner and MLS reference data,” in Int. Conf. Image Analysis and Recognition, LNCS, 2017, pp. 210-220.) is largely confused if the registration errors between the compared point clouds are significantly greater than the used voxel size. Such situations result in large numbers of falsely detected change-pixels, which fact yields on average very low precision result (0.44), although due to several accidental matches, the recall rate might be relatively high (0.88) (see the definition of precision and recall above).

Investigating the numbers in Tablel , it is clear that the ChangeGAN method (i.e. the method according to the invention) has good results in many aspects (it has good accuracy and precision).

The measured low computational cost means a second strength of the proposed ChangeGAN approach, especially versus the MRF model, whose execution time is longer with one order of magnitude. Although ChangeNet is even faster than ChangeGAN, its performance is significantly weaker compared to the other two methods. Since the adversarial training strategy has a regularization effect (see P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentation using adversarial networks,” in NIPS 2016 Workshop on Adversarial Training, Dec 2016, Barcelona, Spain), and the STN layer can handle coarse registration errors, the proposed ChangeGAN model can achieve better generalization ability and it outperforms the reference models on the independent test set. Note that in each case of Tablel running speed was measured in seconds on a PC with an i8-8700K CPU @3.7GHz x12, 32GB RAM, and a GeForce GTX 1080Ti.

As also touched upon above, before the GAN-based training idea, we have experimented with the “generator only” training. In this experiment, the generator model was trained using only the L1 (mean absolute pixel difference) loss between the generated and the expected ground truth images. The model was able to predict some change regions, however, the results were ambiguous and blurry. Based on our experiments the model had low generalizing ability. To achieve better regularization (cf. Pauline Luc and C. Couprie and Soumith Chintala and Jakob Verbeek “Semantic Segmentation using Adversarial Networks”, NIPS Workshop on Adversarial Training, Dec 2016, Barcelona, Spain) on the relatively small dataset, the proposed ChangeGAN method was trained in an adversarial manner so that we backpropagated the adversarial loss too. While the L1 loss controls that the generated images will be similar to the target images in the L1 sense, the adversarial loss ensures that the predicted images will be within the target domain. In other words, the generated images should be so realistic that the discriminator should be fooled.

For the ChangeGAN by backpropagating the adversarial loss alongside the cross entropy loss between the predicted and the ground truth masks, we can achieve a regularization effect, so the model can achieve better performance on a smaller dataset as well.

Herebelow, qualitative results are disclosed. For qualitative analysis, we backprojected the 2D binary change masks to the corresponding 3D point clouds and visually inspected the quality of the proposed change detection approach. During the investigations, we have observed similarly efficient performance for the remaining, originally unregistered point cloud pairs of the Change3D dataset, to the point cloud set with simulated registration errors which participated in the quantitative tests described above under quantitative results.

For reasons of scope, we can only present here short discussions for two sample scenes displayed in Figs. 4A-4E and 6A-6FI.

In Figs. 4A-4E changes detected by ChangeGAN for a coarsely registered point cloud pair are illustrated. Figs. 4A and 4B show the two input point clouds (generally, first and second 3D information data blocks 240a and 240b), as well as Fig. 4C displays the coarsely registered input point clouds in a common coordinate system in a 3D information data block 240. Fig. 4C simply illustrates the two inputs in a common figure.

In Fig. 4A (darker image) a bus can be observed in the right and two cars on the left. In Fig. 4B (lighter image), however, a tram and a car can be seen in the centre and on the right, but no cars are visible on the left. Figs. 4A-4B are unified in Fig. 4C, where the darker and lighter points of Figs. 4A-4B are observable.

Figs. 4D and 4E present the change detection results (i.e. a common change data block 245): originally blue and green coloured points (in greyscale: darker and lighter) represent the objects marked as changes in the first and second point clouds, respectively. Thus, the above mentioned cars on the left and the bus on the right of Fig. 4A are shown with a darker colour in the change data block 245, as well as the tram and the car of Fig. 4B are shown with a lighter colour therein.

In Fig. 4E shows the change data block 245 from above, wherein ellipse 246 draws attention to the global alignment difference between the two coarsely registered point clouds.

Accordingly, Figs. 4A-4E contain a busy road scenario, where different moving vehicles appear in the two point clouds. As shown, moving objects both from the first (originally blue colour - i.e. darker colour in the vehicles - in the change data block 245) and second (originally green, i.e. lighter colour in the vehicles) frames (i.e. Figs. 4A-4B, respectively), are accurately detected despite the large global registration errors between the point clouds (highlighted by the ellipse 246 in Fig. 4E). Let us also observe that a change caused by a moving object in a given frame also implies a changed area in the other frame in its shadow region, which does not contain reflections due to occlusion (of. the shadow regions of Figs. 4A-4B with Fig. 4D where e.g. the shadow region of the tram is well observable). This phenomenon is a consequence of our change definitions, however, the shadow changes can be filtered out by geometric constraints, if they are not needed for a given application.

Looking at Figs. 4C and 4D-4E: in Fig. 4C the two inputs (Figs. 4A-4B) are simply superimposed, it is not identified what is the same and what is a change on them. It is noted that the most important function of the change data block (map) is to mark the changes.

However, the changes cannot be obtained simply because the two input data blocks are not registered (these cannot be subtracted from each other). Therefore, with the artificial intelligence on which the machine learning module is based, it is necessary to determine what is common and what is not in the inputs, avoiding the consequences of above the fact that they are not registered.

Figs. 4D-4E contain the registration error, which can be observed more clearly in the top view of Fig. 4E. Although we illustrate these in one figure, we otherwise prefer to treat the two changes separately.

By comparing Figs. 4B, 4C, for example at the left edge of the circular part in the middle, it can be seen that there is a slight difference as they are not overlapped. This can also be observed in the corresponding part of Fig. 4E (the black and grey points - i.e. circle-like point series - are not overlapped).

So in Fig. 4D the transformation (rotation) is not really visible to the eye and not particularly disturbing. Flowever, this fact is an excellent illustration that it is not trivial to deal with the lack of registration. The invention provides a solution to this problem, but prior art approaches cannot deal with the lack of registration adequately (see Figs. 5A-5D and 6A-6H).

The lack of registration (the input data blocks are not transformed onto each other) also makes it appropriate to use two separate change masks, because this way we get the change information separately for the two inputs that are not transformed into each other.

Fig. 6A-6FI show comparative results of the ground truth and the predicted changes by ChangeGAN and the reference techniques for the region marked by rectangle 109 in Figs. 1A-1D and Fig. 5A-5D (Figs. 4A-4E illustrate a different scene). Figs. 6A-6B show originally coloured and greyscale versions of the same content (this statement holds true also for pairs of Figs. 6C-6D, 6E-6F, and 6G-6FI), respectively, i.e. the ground truth change mask. Furthermore, Figs. 6C-6D show the ChangeGAN predicted change, Figs. 6E-6F show the ChangeNet predicted change, as well as Figs. 6G-6FI show the MRF predicted change.

Figs. 6I and 6J (arranged for Figs. 6A-6D and Figs. 6E-6H, respectively) show shadow bars for showing correspondence between the colours of originally coloured and greyscale versions, and, by the help of showing correspondences, the colours of both versions can be interpreted when all of the figures are shown in greyscale or black and white. On the left side of Figs. 6I and 6J from the top to the bottom black, grey, blue and green colours were originally shown. On the right side of Figs. 6I and 6J the corresponding shadows of grey can be seen. In Figs. 6A, 6C, 6E and 6G originally green and blue points (see the shadow bar of Fig. 6I and 6J for interpreting the correspondence with the same content Figs. 6B, 6D, 6F and 6H, respectively) mark changed regions in P1 and P2, respectively. In Figs. 6A, 6C, 6E and 6G black shows the points of the first point cloud and grey shows the points of second point cloud. A first ellipse 300 and a second ellipse 302 throughout Figs. 6A-6H mark the detected front and back part of a bus travelling in the upper lane, meanwhile occluded by other cars (cf. Figs. 1A-1B where the movement of the bus can be observed and thus the change visualized in Figs. 6A-6FI can be interpreted). A first square 304 shows a building facade segment, which was occluded in Pi (cf. Fig. 1A). The boxes 306 highlight false positive changes of the reference methods in Figs. 6E-6G confused by inaccurate registration (ChangeGAN in Figs. 6C-6D behaves very well in this region).

Thus, Figs. 6A-6FI display another traffic situation (different from Figs. 4A-4E), where the output of the proposed ChangeGAN technique can be compared to the manually verified Ground Truth (Figs. 6A-6B) and to the two reference methods (Figs. 6E-6FI) in the 3D point cloud domain.

As shown, our results (i.e. those obtained a change detection generator module trained by the training method according to the invention, and for the training method, the training set generating method, also according to the invention, provided the training data) accurately reflect our change concept defined in this description, while the reference techniques cause multiple missing or false positive change regions.

Since a bus travelling in the upper lane was partially occluded by other cars, only its frontal and the rear parts could be detected as changes (see ellipses 300 and 302). Flowever, the ChangeNet model in Figs. 6E-6F missed detecting its frontal region in ellipse 300, mostly the rear part in ellipse 302 and a partially occluded facade segment in square 304 (these are light grey coloured instead originally green/dark grey in Figs. 6E-6F, i.e. - erroneously - (almost) no change has been detected in these regions).

In addition, both reference methods detected false changes in the bottom left corner of the image (in box 306: there is almost no changes in the ground truth and in the results of ChangeGAN in Figs. 6A-6D, but many changes - illustrated by originally blue and green points in Figs. 6E and 6G/many shades of grey in this region in Figs. 6F and 6H - are shown in this region; in greyscale it is advantageous to show the content of the box 306 with two type of colouring, since in the box 306 Figs. 6F and 6H show much more variability than Figs. 6E and 6G, see this in view of Fig. 6B below), which were caused by the inaccurate registration (please find more details above: these are false positive results in Figs. 6E-6H, since these approaches detect change in this region, although the ground truth shows in Figs. 6A-6B that no real change to be detected can be found there). In box 306 there are the same objects, i.e. no change should be detected (there is only registration difference, but this cannot be tolerated by ChangeNet and MRF). Moreover, MRF gives many false positive changes in the upper right corner (of. Figs. 1 A-1 B, this region is not denoted by an ellipse or a square).

Finally, we note with emphasize that our method has also successfully performed for frame pairs from the KITTI dataset (see A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” International Journal of Robotics Research, vol. 32, no. 11, pp. 1231-1237, Sept. 2013.), which were completely independent of our training process.

In the following, robustness analysis is described. To evaluate the performance dependency of the discussed methods on the translation and orientation differences between the compared point clouds, we generated two specific sample subsets within the new Change3D dataset. This experiment was based on 500 (originally registered as interpreted above) point cloud pairs, selected from the 2,000 test sample pairs of the dataset.

For translation-dependency analysis, we used an offset domain of [0.1, 1.0] meters, which was discretized using 10 equally spaced bins. For test set generation, we iterated through all the 500 point cloud pairs: For every sample, we chose for each translation bin 0.1 < ti < 1.0 (i = 1 ... 10) a random rotation value -10° < on < 10° and transformed the second cloud P2 using (ti, on) (see Fig. 7 showing translation (hollow marks) and rotation (solid marks) dependency of the compared methods’ performance (F1 -score); translation steps are between [0:1; 1:0] meters: step in the x-axis is 10 cm, rotation steps are between 1° : 10°: step in the x-axis is 1°).

With this process, for each offset bin, we generated 500 coarsely registered point cloud pairs with known registration errors. Accordingly, in total, 10 subsets were created for the 10 offset bins, each one containing 500 samples (i.e. we generated 10 small databases with different sizes of transformations in this case and in the case of rotation below).

Next, we ran our proposed method and the reference techniques on this new set, and we calculated the mean F1 -score value (see C. Benedek, “3d people surveillance on range data sequences of a rotating lidar,” Pattern Recognition Letters, vol. 50, pp. 149-158, 2014, depth Image Analysis and J. Schauer and A. NOchter, “Removing non-static objects from 3d laser scan data,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 143, pp. 15-38, 2018.) for each translation bin i, among samples having an offset parameter ti.

Thus, Fig. 7 displays with hollow marks the average F1 -scores in a function of various ti values. The proposed method shows only a graceful degradation by increased offsets, and even for a ti = 1 meter offset, the quality of change detection is significantly better than the nearly constant low values provided by the reference approaches.

For measuring the rotation-dependency of the models, we have performed a similar experiment: here we discretized the -10° < on < 10° rotation domain with 10 bins, and within each bin, we generated 500 sample pairs, with random translation values. Finally, we averaged the measured F1 -scores within each rotation bin (see C. Benedek, “3d people surveillance on range data sequences of a rotating lidar,” Pattern Recognition Letters, vol. 50, pp. 149-158, 2014, depth Image Analysis and J. Schauer and A. NOchter, “Removing non-static objects from 3d laser scan data,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 143, pp. 15-38, 2018.). Results shown in Fig. 7 with solid marks confirm again the superiority of the proposed method against the tested references.

As shown in Fig. 7, ChangeNet performs consistently poorly on both offset and rotation. We explained this by saying that ChangeNet is “so bad by default” that it cannot be made worse by increasing the registration error. MRF performs better but consistently worse than ChangeGAN. MRF cuts off as a function of rotation. Overall, ChangeGAN only achieves poor performance above 80 cm offset, which is the best performance of the two prior art approaches (see the beginning of the MRF rotation curve). Giving conclusions, in this description ChangeGAN (the training method according to the invention and the training set generating method corresponding thereto), a novel, robust and quick change detection method was presented, which is capable of detecting differences (changes) between coarsely registered point cloud pairs. It has been shown that our approach outperforms in effectiveness both a state-of-the- art deep learning method (ChangeNet) trained on range images, and a 3D voxel- level, MRF-based change detection technique. In other words, in this description we introduced a novel change detection approach called ChangeGAN for coarsely registered point clouds in preferably complex street-level urban environment.

Accordingly, the following points are emphasized in connection with the advantages of the invention:

- We provide a training method for a new task, specifically for the task of detecting change for a coarsely registered pair of 3D information data blocks, the training method is specialized so that the changes in the coarsely registered 3D information data block pair to be effectively detected, i.e. to achieve a change detection system of high efficiency.

- We provide a complete method (solution) for point-based detection of changed regions due to object displacements between initially unmatched (coarsely registered) and individually very incomplete (limited resolution, inhomogeneous characteristics, limited range of viewing angles/distance) point clouds (in general 3D information data blocks).

- We emphasize the need to solve this task because in these practical cases, reliable registration and therefore change detection cannot be achieved with currently available methods. The task of change detection for 3D information data blocks, in particular point clouds has not yet been approached this way, but we have provided a solution being workable based on our experiments, which is more efficient than known approaches.

- The specification and generating of the dataset used to train the neural network is also highly advantageous: it would be almost impossible to annotate them manually, but starting from real images, we can create a training set (training database) having satisfactory size by artificial perturbation steps, which can be used to perform highly effective recognition on previously unregistered data. - Accordingly, in the invention, we have identified a new way of posing a problem: we want to describe the differences without exactly matching the available input models, and we have designed the necessary architecture/database/training method.

As touched upon above, our generative adversarial network (GAN) architecture preferably compounds Siamese-style feature extraction, U-net-like use of multiscale features, and STN blocks for optimal transformation estimation. The input point clouds - as typical inputs - are preferably represented by range images, which advantageously enables the use of 2D convolutional neural networks. The result is preferably a pair of binary masks showing the change regions on each input range image, which can be backprojected to the input point clouds (i.e. to the change detection generator module) without loss of information.

As concluded above, we have evaluated the proposed method according to the invention on various challenging scenarios and we have shown its superiority against state-of-the-art change detection methods.

The followings are noted in connection with restrictions and generalization possibilities. During the tests performed so far, the same sensor (Velodyne HDL 64- E) was used at different time instants to record the point clouds to be compared. There was an unknown offset difference of maximum 1 meter between the reference points of the point clouds, their orientations differed in a maximum rotation of 10° (around a vertical axis). Neither the previous restrictions on the type of sensor nor on the degree of shift / rotation have a conceptual relevance, it is only that the testing of the proposed method and the detailed documentation of the results have been done with these parameters so far.

To summarize in connection with the exact details of the method, the followings were given above: the object of the method, the most important professional literature background and references, a description of the implementation, the structure and parameters of the exemplary applied neural network, the specification of the data used to train and test the method and the way of how they were recorded, as well as comparison of the results of the tests with the literature methods and the conclusions drawn from them. The present invention is not limited to the preferred embodiments presented above, and further variants, modifications, changes, and improvements may also be conceived within the scope defined by the claims.

Claims

1. A training method for training a change detection system for detecting change for a coarsely registered pair (150, 200) of a first 3D information data block (100a, 150a, 200a, 240a) and a second 3D information data block (100b, 150b, 200b, 240b), wherein

- the change detection system comprises a change detection generator module (160, 210) based on machine learning and adapted for generating a change data block (175a, 175b, 215) for a coarsely registered pair (150, 200) of a first 3D information data block (100a, 150a, 200a, 240a) and a second 3D information data block (100b, 150b, 200b, 240b), and

- in course of the training method a discriminator module (220) based on machine learning is applied, and in at least one training cycle of the training method for a plurality of coarsely registered pairs (150, 200) of first 3D information data block (100a, 150a, 200a, 240a) and second 3D information data block (100b, 150b, 200b, 240b) and a plurality of respective target change data blocks (120a, 120b, 205, 260) of a training set

- generating (S180) by means of the change detection generator module (160, 210) change data blocks (175a, 175b, 215) for the plurality of coarsely registered pairs (150, 200) of the training set and generating (S185) a generator loss contribution (225) based on corresponding combinations of a change data block (175a, 175b, 215) and a target change data block (120a, 120b, 205, 260),

- generating (S190) a discriminator loss contribution (230) by applying the discriminator module (220) on a plurality of coarsely registered pairs (150, 200) of the training set, as well as corresponding target change data blocks (120a, 120b, 205, 260) of the training set and corresponding change data blocks (175a, 175b, 215), and

- training (S195) the change detection generator module (160, 210) by a combined loss (235) obtained from a summation of the generator loss contribution (225) and the discriminator loss contribution (230), wherein in the summation at least one of the generator loss contribution (225) and the discriminator loss contribution (230) is multiplied by a respective loss multiplicator (227).

2. The method according to claim 1 , characterized in that

- the coarsely registered pair (150, 200) of the first 3D information data block (100a, 150a, 200a, 240a) and the second 3D information data block (100b, 150b, 200b, 240b) is constituted by a coarsely registered range image pair of a first range image and a second range image, or

- a coarsely registered range image pair of a first range image and a second range image is generated from the coarsely registered pair (150, 200) of the first 3D information data block (100a, 150a, 200a, 240a) and the second 3D information data block (100b, 150b, 200b, 240b) before the application of the at least one training cycle.

3. The method according to claim 1 or claim 2, characterized by applying a spatial transformer module (165a, 165b)

- based on machine learning

- adapted for helping in processing any translation and/or rotation between the first 3D information data block (100a, 150a, 200a, 240a) and the second 3D information data block (100b, 150b, 200b, 240b) corresponding to a coarsely registered pair (150, 200), and

- comprised in the change detection generator module (160, 210).

4. The method according to claim 3, characterized by applying

- a downsampling unit (162) having a first row of downsampling subunits (164a, 164b) and interconnected with

- an upsampling unit (170) having a second row of upsampling subunits (172) and corresponding to the downsampling unit (162), wherein the downsampling unit (162) and the upsampling unit (170) is comprised in the change detection generator module (160, 210) and the spatial transformer module (165a, 165b) is arranged in the downsampling unit (162) within the first row of downsampling subunits (164a, 164b).

5. The method according to any of claims 1-4, characterized in that a change mask image having a plurality of first pixels is generated by the change detection generator module (160, 210) as the change data block (175a, 175b, 215) and the target change data block (120a, 120b, 205, 260) is constituted by a target change mask image having a plurality of second pixels, wherein to each of the plurality of first pixels and a plurality of second pixels presence of a change or absence of a change is assigned.

6. A change detection system adapted for detecting change for a coarsely registered pair (150, 200) of a first 3D information data block (100a, 150a, 200a, 240a) and a second 3D information data block (100b, 150b, 200b, 240b), wherein the system

- is trained by the training method according to any of claims 1 -5, and

- comprises the change detection generator module (160, 210).

7. A training set generating method for generating a plurality of coarsely registered pairs (150, 200) of a first 3D information data block (100a, 150a, 200a, 240a) and a second 3D information data block (100b, 150b, 200b, 240b) and a plurality of respective target change data blocks (120a, 120b, 205, 260) of a training set for applying in the training method according to any of claims 1 -5, in the course of the method

- generating or providing a plurality of registered base pairs (300) of a first base 3D information data block and a second base 3D information data block,

- performing change annotation in a change annotation step (S310) on the first base 3D information data block and the second base 3D information data block of each of the plurality of registered base pairs (300),

- generating, by transforming in a transformation step (S320) at least one of the first base 3D information data block and the second base 3D information data block of each of the plurality of registered base pairs (300), a first resultant 3D information data block and a second resultant 3D information data block for each of the plurality of registered base pairs (300), and

- generating in a training data generation step (S330) the plurality of coarsely registered pairs (150, 200) and the plurality of respective target change data blocks (120a, 120b, 205, 260) of the training set based on a respective first resultant 3D information data block and second resultant 3D information data block.

8. The method according to claim 7, characterized by applying before the change annotation step (S310) one or more artificial change by addition or deletion to any of the first base 3D information data block and the second base 3D information data block of each of the plurality of registered base pairs (300).

9. The method according to claim 7 or claim 8, characterized in that the plurality of registered base pairs (300) of a first base 3D information data block and a second base 3D information data block are constituted by a plurality of registered base point cloud pairs of a first base point cloud having a plurality of first points and a second base point cloud having a plurality of second points, and in the change annotation step (S310)

- applying a 3D voxel grid having a plurality of voxels on the first base point cloud and the second base point cloud,

- for each voxel assigning change to every first point comprised in a voxel and to every second point comprised in that voxel in which o a first voxel point ratio of those first points of the first base point cloud not having a corresponding second point in the second base point cloud and a first total number of first points is above a predetermined first ratio limit, and o a second voxel point ratio of those second points of the second base point cloud not having a corresponding first point in the first base point cloud and a second total number of second points is above a predetermined second ratio limit, respectively, and

- a first target change data block (120a) and a second target change data block (120b) are generated based on assigned change for first points and second points, respectively.

10. The method according to claim 9, characterized in that the predetermined first ratio limit and the predetermined second ratio limit is both 0.9.

11. The method according to any of claims 7-10, characterized by applying in the transformation step randomly an up to ±1m translation and/or an up to ±10° rotation transform.