CN115601661A

CN115601661A - A building change detection method for urban dynamic monitoring

Info

Publication number: CN115601661A
Application number: CN202211344397.7A
Authority: CN
Inventors: 徐川; 叶昭毅; 杨威; 梅礼晔; 张琪; 李迪
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-01-13
Anticipated expiration: 2042-10-31
Also published as: CN115601661B

Abstract

The invention discloses a building change detection method for urban dynamic monitoring, which receives an urban surface double-time-phase image detected by a remote sensing satellite through a satellite technology, cuts an original image, inputs the cut image into an urban building automatic detection model, and outputs a change detection result of a building in the double-time-phase image; the automatic detection model of the urban building comprises an encoding stage and a decoding stage. In the encoding stage, a twin network shared by weight is adopted to carry out down-sampling operation on the input double-time phase image, rich multi-scale characteristic information is extracted, and meanwhile, the expression of the characteristic information is enhanced by utilizing a twin cross attention mechanism; in the decoding stage, a multi-scale feature fusion module is adopted to carry out progressive fusion on the extracted multi-scale features; and the detection result is pushed to be closer to the actual change condition by using the differential context discrimination module. The method can efficiently judge and fuse multiple features, thereby improving the accuracy of detecting the change of the urban buildings.

Description

A building change detection method for urban dynamic monitoring

技术领域technical field

本发明属于城市动态监测领域，更具体地，涉及一种用于城市动态监测的建筑物变化检测方法。The invention belongs to the field of urban dynamic monitoring, and more particularly relates to a building change detection method for urban dynamic monitoring.

背景技术Background technique

目前大部分的城市建筑物自动化监测系统需要在城市周围铺设大规模的检测设备以及电缆，而设备的供电和维护需要很高的成本，同时，受到信号干扰，拍摄角度变化，光照影响等因素较大，会导致检测系统发生误监和漏监等问题。遥感技术能够以固定的时间间隔获取地球表面的信息，并提取同一表面在多个时间段的动态变化。城市建筑物自动化检测模型基于遥感变化检测技术，其任务是观察同一目标在不同时期的差异变化，并对每个影像像素点进行标签分类，即标签0(不变)和标签1(变化)。到目前为止，研究人员在遥感变化检测的理论和应用方面做了大量的工作。这些贡献对土地资源管理、城市建设和规划、违章施工管理等方面具有重要意义。At present, most of the automatic monitoring systems of urban buildings need to lay large-scale detection equipment and cables around the city, and the power supply and maintenance of the equipment require high costs. If it is large, it will lead to problems such as false monitoring and missing monitoring in the detection system. Remote sensing technology can acquire the information of the earth's surface at fixed time intervals and extract the dynamic changes of the same surface in multiple time periods. The automatic detection model of urban buildings is based on remote sensing change detection technology. Its task is to observe the difference changes of the same target in different periods, and to classify each image pixel, that is, label 0 (unchanged) and label 1 (change). So far, researchers have done a lot of work on the theory and application of remote sensing change detection. These contributions are of great significance to land resource management, urban construction and planning, and illegal construction management.

在过去的几十年里，许多算法被提出用于遥感影像变化检测的模型。这些算法可以大致分为两类：传统方法和基于深度学习的方法。对于传统方法，一开始受限于遥感影像的分辨率，多采用基于像素的方法进行变化检测，利用变化向量分析(change vectoranalysis,CVA)和主成分分析(principal component analysis, PCA)对每个像素点的光谱特征进行分析，从而进行变化检测。随着航空航天和遥感技术的快速发展，获取高分辨率遥感影像的能力得到了增强。学者们在变化检测领域引入对象的概念，主要使用基于对象层次的的光谱、纹理和空间背景信息进行变化检测。虽然这些方法在当时能够取得较好的效果，但传统方法需要人工设计特征且规定阈值来保证最终的检测效果，并且只能提取浅层特征，无法充分表征高分辨率遥感影像中建筑物的变化，因而难以满足现实中对精度的要求。Over the past few decades, many algorithms have been proposed for models of change detection in remote sensing imagery. These algorithms can be roughly divided into two categories: traditional methods and deep learning-based methods. For traditional methods, limited by the resolution of remote sensing images at the beginning, pixel-based methods are mostly used for change detection, and change vector analysis (change vector analysis, CVA) and principal component analysis (principal component analysis, PCA) are used to analyze each pixel The spectral features of the points are analyzed for change detection. With the rapid development of aerospace and remote sensing technology, the ability to obtain high-resolution remote sensing images has been enhanced. Scholars have introduced the concept of objects in the field of change detection, mainly using object-level spectral, texture and spatial background information for change detection. Although these methods were able to achieve good results at the time, the traditional methods required manual design of features and specified thresholds to ensure the final detection effect, and could only extract shallow features, which could not fully represent the changes of buildings in high-resolution remote sensing images. , so it is difficult to meet the accuracy requirements in reality.

另一方面，随着计算能力的发展和海量数据的积累，基于深度学习的变化检测算法因为其强大的性能，已经成为了主流。目前，基于深度学习的变化检测方法大多是在对比学习和分割任务上具有较好效果的网络发展而来。部分学者采用聚焦对比损失进行变化检测，减少了类内方差，增加了类间差异，最终通过阈值限制得到二值化检测结果。分割网络以图像分割的思想进行变化检测，代表性的例子有U形网络(UNet)和全卷积神经网络(Fully Convolutional Networks,FCN) 以及DeepLab系列网络。On the other hand, with the development of computing power and the accumulation of massive data, the change detection algorithm based on deep learning has become the mainstream because of its powerful performance. At present, most of the deep learning-based change detection methods are developed from networks that have a good effect on contrastive learning and segmentation tasks. Some scholars use focused contrast loss for change detection, which reduces intra-class variance and increases inter-class differences, and finally obtains binarized detection results through threshold restrictions. The segmentation network uses the idea of image segmentation for change detection. Representative examples include U-shaped network (UNet), fully convolutional neural network (Fully Convolutional Networks, FCN) and DeepLab series networks.

尽管这些方法也达到了很高的性能，但仍存在以下问题：首先，当有的前后时序影像中存在大量的伪变化时，当前的注意力机制无法高效且有针对性的关注于无变化区域和变化区域，这会导致严重的错误检测现象的发生。其次，现有网络中存在大量的下采样和上采样操作导致前后时序影像中特征信息的丢失，粗暴的融合策略更是加深了这一问题，这使得网络在最后的变化检测时无法很好地恢复影像原有的特征，最终的检测结果会存在漏检和变化边缘不整齐等问题。最终，当前算法无法很好的对上下文信息进行差分处理，因此对于存在很多伪变化的城市建筑物影像的检测效果并不好。Although these methods have also achieved high performance, there are still the following problems: First, when there are a large number of spurious changes in some front and back time-series images, the current attention mechanism cannot efficiently and targetedly focus on the no-change area and changing regions, which can lead to serious false detection phenomena. Secondly, there are a large number of down-sampling and up-sampling operations in the existing network, which leads to the loss of feature information in the front and rear time-series images, and the rough fusion strategy aggravates this problem, which makes the network unable to perform well in the final change detection. To restore the original features of the image, the final detection result will have problems such as missed detection and irregular edges. Finally, the current algorithm cannot perform differential processing on context information well, so the detection effect on urban building images with many spurious changes is not good.

发明内容Contents of the invention

针对现有技术的缺陷以及改进需求，本发明提供了一种用于城市动态检测的建筑物变化检测方法，可以精准地实现城市建筑物自动化检测。包括如下步骤：Aiming at the defects and improvement needs of the prior art, the present invention provides a building change detection method for urban dynamic detection, which can accurately realize automatic detection of urban buildings. Including the following steps:

S1，采用遥感卫星采集到的城市建筑物影像作为数据集，获取数据集中各个建筑物对应的实际变化影像，将实际变化影像与对应的双时相影像划分为训练集和测试集；S1, using the images of urban buildings collected by remote sensing satellites as a data set, obtaining the actual change images corresponding to each building in the data set, and dividing the actual change images and the corresponding dual-temporal images into a training set and a test set;

S2，构建由编码器和解码器组成的建筑物自动化检测模型，所述编码器包括权重共享的双通道孪生网络和孪生交叉注意力模块，所述解码器包括多尺度特征融合以及差分上下文判别模块；S2. Build an automatic building detection model consisting of an encoder and a decoder. The encoder includes a weight-sharing dual-channel Siamese network and a Siamese cross-attention module. The decoder includes a multi-scale feature fusion and differential context discrimination module. ;

所述权重共享的双通道孪生网络包括批归一化层和多个上采样块，输入双时相影像，用于获取不同尺度的特征图；The weight-shared dual-channel Siamese network includes a batch normalization layer and multiple upsampling blocks, inputting bitemporal images for obtaining feature maps of different scales;

所述孪生交叉注意力模块首先对不同尺度的特征图进行嵌入操作，然后利用多头交叉注意力机制提取更深层次的变化特征语义信息，提高对特征信息的全局关注度；The twin cross-attention module first performs embedding operations on feature maps of different scales, and then uses the multi-head cross-attention mechanism to extract deeper semantic information of changing features to improve global attention to feature information;

所述多尺度特征融合模块采用重建与上采样块的双重渐进式融合策略，将提取到的含有丰富多尺度语义信息的特征进行融合；The multi-scale feature fusion module adopts a double progressive fusion strategy of reconstruction and up-sampling blocks to fuse the extracted features containing rich multi-scale semantic information;

所述差分上下文判别模块的输入为多尺度融合模块的输出影像以及前后时序差分影像，目的是结合影像中的上下文信息，提高网络的判别能力，使得检测结果影像更加趋近于真实变化影像，从而提高检测准确度；The input of the differential context discrimination module is the output image of the multi-scale fusion module and the front and rear temporal difference images. The purpose is to combine the context information in the image to improve the discrimination ability of the network, so that the detection result image is closer to the real change image, thereby Improve detection accuracy;

S3，利用S1中的训练集对S2中的建筑物自动化检测模型进行训练，利用训练好的模型实现建筑物变化检测。S3, using the training set in S1 to train the building automatic detection model in S2, and using the trained model to realize building change detection.

在一些可选的实施方案中，步骤S1包括：In some optional embodiments, step S1 includes:

采用人工制作城市建筑物变化影像作为数据集，根据数据集中的前后时序影像来制作实际变化影像，其中，实际变化影像是前后时序影像中的变化区域，前后时序影像中的每个像素代表一种类别(未变化或变化)。Using artificially produced city building change images as a data set, the actual change images are produced according to the front and back time series images in the data set, where the actual change images are the changing areas in the front and back time series images, and each pixel in the front and rear time series images represents a Category (unchanged or changed).

将各前后时序影像以及其对应的实际变化影像组成城市建筑物自动化检测影像数据集，在该数据集中按照8∶2的比例划分训练集和测试集。The front and back time series images and their corresponding actual change images are composed of urban building automatic detection image data set, in which the training set and test set are divided according to the ratio of 8:2.

在一些可选的实施方案中，所述编码器包括权重共享的双通道孪生网络和孪生交叉注意力模块，所述解码器包括多尺度特征融合以及差分上下文判别模块。In some optional implementations, the encoder includes a weight-sharing dual-channel Siamese network and a Siamese cross-attention module, and the decoder includes a multi-scale feature fusion and differential context discrimination module.

在本实施方案中，编码器中的权重共享的双通道孪生网络采用多尺度密集连接UNet进行实现，该网络包含跳跃连接，能够充分提取低级特征和高级特征。编码器中的孪生交叉注意力模块结合Transformer多头注意力机制，孪生交叉注意力模块首先独立地对双时相影像进行嵌入操作，获取到对应的多阶段嵌入式令牌。通过多头注意力机制将特征信息进一步划分为查询队列、查询向量以及查询值，Sigmoid函数进一步激活关注到的特征信息，多层感知机块有效地降低网络的时间复杂度，最终使得注意力通道分别关注于影像中的变化区域以及未变化区域，同时将影像信息划分为滑动窗口进行自注意力计算，提高网络对全局信息的建模能力。In this implementation, the weight-sharing dual-channel Siamese network in the encoder is implemented using a multi-scale densely connected UNet, which contains skip connections and can fully extract low-level features and high-level features. The twin cross-attention module in the encoder is combined with the Transformer multi-head attention mechanism. The twin cross-attention module first independently embeds the bitemporal images to obtain the corresponding multi-stage embedded tokens. The feature information is further divided into query queue, query vector and query value through the multi-head attention mechanism, the Sigmoid function further activates the feature information concerned, and the multi-layer perceptron block effectively reduces the time complexity of the network, and finally makes the attention channels respectively Focus on the changed area and the unchanged area in the image, and divide the image information into sliding windows for self-attention calculation to improve the network's ability to model global information.

所述解码器中的多尺度融合模块利用多尺度特征融合技术将编码器中提取到的多阶段嵌入式令牌与富含上下文信息的通道注意力输出进行融合，再使用上采样操作融合特征，这使得网络能够最大程度地还原原始影像信息，降低网络的漏检率。其次，利用多尺度特征融合技术将嵌入式令牌与信道变压器的富上下文输出进行融合。然后，对提取的多尺度信息内容进行上采样融合，最大限度地恢复原始影像信息；所述解码器中的差分上下文判别模块输入为所述解码器中的多尺度融合模块的输出影像以及前后时序差分影像，目的是结合影像中的上下文信息，提高网络的判别能力，使得检测结果影像更加趋近于真实变化影像，从而提高检测准确度。The multi-scale fusion module in the decoder uses multi-scale feature fusion technology to fuse the multi-stage embedded tokens extracted in the encoder with the channel attention output rich in context information, and then uses the upsampling operation to fuse the features, This enables the network to restore the original image information to the greatest extent and reduce the missed detection rate of the network. Second, the embedded tokens are fused with the rich context output of the channel transformer using multi-scale feature fusion technique. Then, the extracted multi-scale information content is up-sampled and fused to maximize the recovery of the original image information; the input of the differential context discrimination module in the decoder is the output image of the multi-scale fusion module in the decoder and the front and back timing The purpose of the difference image is to combine the context information in the image to improve the discriminative ability of the network, so that the detection result image is closer to the real change image, thereby improving the detection accuracy.

在一些可选的实施方案中，步骤S2中权重共享的双通道孪生网络对输入双时相影像进行批处理归一化操作，包括卷积核3、步长为1的二维卷积、二维 BatchNorm和输出通道数为64的ReLU激活函数，然后通过3个下采样块提取特征信息，定义x^i，j为下采样块的输出节点，下采样块的目标函数为：In some optional implementations, the dual-channel Siamese network with weight sharing in step S2 performs batch normalization operations on the input dual-temporal images, including convolution kernel 3, two-dimensional convolution with a step size of 1, two-dimensional Dimension BatchNorm and the ReLU activation function with 64 output channels, then extract feature information through 3 downsampling blocks, define x ^{i, j} as the output nodes of the downsampling block, and the objective function of the downsampling block is:

其中，N(·)表示嵌套的卷积函数，D(·)表示下采样层，U(·)表示上采样层， []表示特征连接函数，x^i，j代表输出特征图，i代表层数，j代表该层的第j个卷积层，k代表第k个连接层；最终，孪生网络通道输出四种多尺度特征信息。Among them, N(·) represents the nested convolution function, D(·) represents the downsampling layer, U(·) represents the upsampling layer, [] represents the feature connection function, x ^{i, j} represents the output feature map, i represents The number of layers, j represents the jth convolutional layer of the layer, and k represents the kth connection layer; finally, the Siamese network channel outputs four kinds of multi-scale feature information.

在一些可选的实施方案中，步骤S2中的孪生交叉注意力模块对双通道孪生网络的四个输出进行嵌入操作，首先进行一次2D卷积提取特征，然后将特征展开成二维序列T₁，T₂，T₃以及T₄，其patch大小分别为32、16、8和4，将T₁-T₄进行合并得到T_∑，然后利用多头交叉注意力机制进行处理，第一阶段的目标函数为：In some optional embodiments, the Siamese cross-attention module in step S2 performs an embedding operation on the four outputs of the dual-channel Siamese network. First, a 2D convolution is performed to extract features, and then the features are expanded into a two-dimensional sequence T ₁ , T ₂ , T ₃ and T ₄ , whose patch sizes are 32, 16, 8 and 4 respectively, merge T ₁ -T ₄ to get T _∑ , and then use the multi-head cross-attention mechanism for processing, the goal of the first stage The function is:

其中，

W_K和W_V为不同输入的权重系数，T_l表示特征信息令牌，l表示第l个尺度的特征信息，T_∑表示四个令牌的特征联合，得到查询向量Q_u，查询键 K，查询值V，l＝1，2，3，4，u＝1，2，3，4；in,

W _K and W _V are the weight coefficients of different inputs, T _l represents the feature information token, l represents the feature information of the lth scale, T _∑ represents the feature union of four tokens, and the query vector _Qu is obtained, and the query key K , query value V, l=1, 2, 3, 4, u=1, 2, 3, 4;

第二阶段的目标函数为：The objective function of the second stage is:

其中，σ(·)和

分别表示softmax函数和实例规范化函数，C_∑表示通道数的总和；Among them, σ(·) and

Represents the softmax function and the instance normalization function, respectively, and C _∑ represents the sum of the number of channels;

多头交叉注意力第三阶段的目标函数为：The objective function of the third stage of multi-head cross attention is:

其中，CA_h表示多头交叉注意力第二阶段的输出，h表示第h个注意力头的输出，N为注意力头的个数；Among them, CA _h represents the output of the second stage of multi-head cross attention, h represents the output of the hth attention head, and N is the number of attention heads;

多头交叉注意力最终阶段的目标函数为：The objective function of the final stage of multi-head cross attention is:

O_r＝MCA_p+MLP(Q_u+MCA_p)O _r ＝MCA _p +MLP(Q _u +MCA _p )

确定多头交叉注意力最终的输出，其中，MCA_p表示多头交叉注意力第三阶段的输出，p表示第p个输出，MLP(·)为多层感知机函数，Q_u表示查询向量， u表示第u个查询向量。Determine the final output of multi-head cross attention, where MCA _p represents the output of the third stage of multi-head cross attention, p represents the pth output, MLP( ) is the multi-layer perceptron function, Qu represents the query vector, _u represents uth query vector.

在一些可选的实施方案中，步骤S2中，多尺度特征融合模块的目标函数为：In some optional embodiments, in step S2, the objective function of the multi-scale feature fusion module is:

M_i＝W₁·V(T_l)+W₂·V(O_r)M _i =W ₁ ·V(T _l )+W ₂ ·V(O _r )

其中，W₁和W₂是两个线性层的权重参数，T_l表示特征信息令牌，l表示第l个尺度的特征信息，O_r表示多头交叉注意力模块的输出，r表示第r个注意力头的输出。Among them, W ₁ and W ₂ are the weight parameters of two linear layers, T _l represents the feature information token, l represents the feature information of the l-th scale, O _r represents the output of the multi-head cross-attention module, r represents the r-th The output of the attention head.

在一些可选的实施方案中，步骤S2中，所述差分上下文判别模块包含生成器和判别器，生成器中接收两个输入，多尺度特征融合模块最后一层的得到的检测影像以及第一和第二时相做差分运算得到的生成影像，计算二者的损失以推动结果更加接近于实际变化影像，在生成器中采用SCAD和最小二乘LSGAN损失函数的加权和作为损失函数，降低模型的误监率；在判别器中采用最小二乘 LSGAN损失函数，改善检测精度，将生成器和判别器的损失函数累加得到最终的概率损失。In some optional implementations, in step S2, the differential context discrimination module includes a generator and a discriminator, and the generator receives two inputs, the detected image obtained from the last layer of the multi-scale feature fusion module and the first The generated image is obtained by difference operation with the second phase, and the loss of the two is calculated to push the result closer to the actual changing image. In the generator, the weighted sum of SCAD and the least squares LSGAN loss function is used as the loss function to reduce the model The false positive rate; the least squares LSGAN loss function is used in the discriminator to improve the detection accuracy, and the loss function of the generator and the discriminator is accumulated to obtain the final probability loss.

在一些可选的实施方案中，步骤S2中，所述差分上下文判别模块的目标函数为：In some optional implementations, in step S2, the objective function of the differential context discrimination module is:

L(P)＝L(D)+L(G)L(P)=L(D)+L(G)

L(D)＝L_LSGAN(D)L(D)＝L _LSGAN (D)

L(G)＝L_LSGAN(D)+αL_SCAD L(G)＝L _LSGAN (D)+αL _SCAD

其中，L(P)表示概率损失，L(D)表示判别器损失，L(G)表示生成器损失， L_LSGAN(D)表示判别器的最小二乘LSGAN损失，L_LSGAN(G)表示生成器的最小二乘LSGAN损失，L_SCAD表示SCAD损失。Among them, L(P) represents the probability loss, L(D) represents the discriminator loss, L(G) represents the generator loss, L _LSGAN (D) represents the least squares LSGAN loss of the discriminator, L _LSGAN (G) represents the generated LSCAD is the least squares LSGAN loss of the filter, and L _SCAD is the SCAD loss.

在一些可选的实施方案中，定义SCAD损失为：In some optional implementations, the SCAD loss is defined as:

其中，C表示检测种类，v(c)表示检测种类的像素误差值，J_C为损失项，ρ为连续优化的参数，v(c)的定义如下：Among them, C represents the detection type, v(c) represents the pixel error value of the detection type, J _C is the loss item, ρ is the parameter for continuous optimization, and v(c) is defined as follows:

其中，y_i为实际变化影像，s_g(c)为检测分数，g表示第g个像素。Among them, y _i is the actual change image, s _g (c) is the detection score, and g represents the gth pixel.

在一些可选的实施方案中，最小二乘LSGAN损失为：In some alternative implementations, the least squares LSGAN loss is:

其中，D(x₁，y)及D(x₁，G(x₁))表示判别器对第一时相影像的输出，G(x₁)表示生成器对第一时相影像的输出，D(x₂，y)及D(x₂，G(x₂))表示判别器对第二时相影像的输出，G(x₂)表示生成器对第二时相影像的输出，

及

表示第一时相影像的检测期望，

及

表示第二时相影像的检测期望，x₁，x₂分别表示判别器输入的第一和第二时相影像，y表示实际变化影像。Among them, D(x ₁ , y) and D(x ₁ , G(x ₁ )) represent the output of the discriminator to the first phase image, G(x ₁ ) represents the output of the generator to the first phase image, D(x ₂ , y) and D(x ₂ , G(x ₂ )) represent the output of the discriminator to the second phase image, G(x ₂ ) represents the output of the generator to the second phase image,

and

Denotes the detection expectation of the first-phase image,

and

Indicates the detection expectation of the second time-phase image, x ₁ and x ₂ respectively represent the first and second time-phase images input by the discriminator, and y represents the actual change image.

其中，

表示第一时相影像的检测期望，

表示第二时相影像的检测期望，D(x₁，G(x₁))表示判别器对第一时相影像的输出，G(x₁)表示生成器对第一时相影像的输出，D(x₂，G(x₂))表示判别器对第二时相影像的输出， G(x₂)表示生成器对第二时相影像的输出，x₁，x₂分别表示判别器输入的第一和第二时相影像。in,

Denotes the detection expectation of the first-phase image,

Denotes the detection expectation of the second phase image, D(x ₁ , G(x ₁ )) represents the output of the discriminator for the first phase image, G(x ₁ ) represents the output of the generator for the first phase image, D(x ₂ , G(x ₂ )) represents the output of the discriminator to the second time-phase image, G(x ₂ ) represents the output of the generator to the second time-phase image, x ₁ , x ₂ represent the input of the discriminator respectively The first and second phase images of .

总体而言，通过本发明所构思的以上技术方案与现有技术相比，能够取得下列有益效果：以深度卷积神经网络为基础，构造由编码器和解码器所组成的建筑物自动化检测模型，能够有效地判别和融合双时相影像中的多尺度特征信息，有效地提高了建筑物变化检测准确度。最终，只需要将双时相影像输入到训练好的模型中，即可自动检测城市建筑物的变化情况。Generally speaking, compared with the prior art, the above technical solutions conceived by the present invention can achieve the following beneficial effects: Based on the deep convolutional neural network, construct an automatic building detection model composed of an encoder and a decoder , which can effectively discriminate and fuse multi-scale feature information in bitemporal images, and effectively improve the accuracy of building change detection. Eventually, only the bitemporal images need to be fed into the trained model to automatically detect changes in urban buildings.

附图说明Description of drawings

图1是本发明实施例提供的一种城市建筑物自动化检测系统及方法的流程示意图；Fig. 1 is a schematic flow diagram of an urban building automatic detection system and method provided by an embodiment of the present invention;

图2是本发明实施例提供的一种建筑物自动化检测模型示意图；Fig. 2 is a schematic diagram of a building automatic detection model provided by an embodiment of the present invention;

图3是本发明实施例提供的一种多头交叉注意力机制网络结构图；FIG. 3 is a network structure diagram of a multi-head cross-attention mechanism provided by an embodiment of the present invention;

图4是本发明实施例提供的一种不同方法的检测对比图。Fig. 4 is a detection comparison diagram of a different method provided by an embodiment of the present invention.

具体实施方式detailed description

下面结合附图和实施例对本发明作进一步详细的说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

采用多尺度密集连接UNet网络提取双时相影像中的丰富地特征信息；孪生注意力机制分别关注与双时相影像中的变化区域以及未变化区域，加强特征信息的表示，并提升信息的全局关注性；采用多尺度特征融合模块渐进式地融合各个尺度的特征信息；同时，通过差分上下文判别模块计算生成器与判别器的加权和作为概率损失，从而推动检测结果接近于真实变化影像。并采用8个评价指标来评价本发明的性能，包括精确率(Precision)，召回率(Recall)，综合评价指标(F1-score)，交集(IOU)，未变化交集(IOU_0)，变化交集(IOU_1)，总体精度(OA)，Kappa系数(Kappa)作为评价指标。下面结合附图和实施例对本发明作进一步详细的说明。The multi-scale dense connection UNet network is used to extract rich feature information in bitemporal images; the twin attention mechanism pays attention to the changed areas and unchanged areas in bitemporal images, strengthens the representation of feature information, and improves the overall information Attention: the multi-scale feature fusion module is used to gradually fuse the feature information of each scale; at the same time, the weighted sum of the generator and the discriminator is calculated by the differential context discrimination module as the probability loss, so as to promote the detection result close to the real change image. And adopt 8 evaluation indexes to evaluate the performance of the present invention, comprise precision rate (Precision), recall rate (Recall), comprehensive evaluation index (F1-score), intersection (IOU), unchanged intersection (IOU_0), change intersection ( IOU_1), overall accuracy (OA), and Kappa coefficient (Kappa) are used as evaluation indicators. The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

图1所示是本发明实施例提供的一种用于城市建筑物自动化检测系统及方法的流程示意图，具体包含以下步骤：Figure 1 is a schematic flow diagram of a system and method for automatic detection of urban buildings provided by an embodiment of the present invention, which specifically includes the following steps:

S1：数据集构建：采用遥感卫星采集到的城市建筑物影像构建数据集,获取数据集中各个建筑物对应的实际变化影像，将实际变化影像与对应的双时相影像作为数据集；S1: Dataset construction: use the images of urban buildings collected by remote sensing satellites to construct a data set, obtain the actual change images corresponding to each building in the data set, and use the actual change images and the corresponding dual-temporal images as a data set;

构建合理的建筑物变化检测数据集能够有效提升模型的检测精度。在本发明实施例的实验中，使用LEVIR-CD数据集，其包含来自20个地区的各种各样的建筑影像。每张影像的原始尺寸为1024×1024像素，空间分辨率为0.5m。考虑到GPU内存容量的限制，采用图像分割算法将每张影像切割成16个大小为256 ×256像素的区域影像，最终获得4450幅前后时序影像对。本发明采用专业的计算机视觉标注软件，对城市建筑物影像进行标注。对于每一对前后时序影像，获取其对应的实际变化影像ground truth，实际变化影像中的每一个像素点代表一种类别，其中，实际变化影像中的类标签用0和1表示，0代表未变化区域(可以显示为黑色)，1代表变化区域(可以显示为白色)。Constructing a reasonable building change detection dataset can effectively improve the detection accuracy of the model. In the experiments of the embodiments of the present invention, the LEVIR-CD dataset is used, which contains a wide variety of architectural images from 20 regions. The original size of each image is 1024×1024 pixels, and the spatial resolution is 0.5m. Considering the limitation of GPU memory capacity, an image segmentation algorithm was used to cut each image into 16 regional images with a size of 256 × 256 pixels, and finally obtained 4450 time-series image pairs before and after. The invention adopts professional computer vision labeling software to label images of urban buildings. For each pair of time-series images before and after, the ground truth corresponding to the actual change image is obtained. Each pixel in the actual change image represents a category, where the class labels in the actual change image are represented by 0 and 1, and 0 represents no Change area (can be displayed as black), 1 represents change area (can be displayed as white).

经过以上处理得到前后时序影像以及其对应的实际变化影像，将各前后时序影像以及其对应的实际变化影像组成城市建筑物自动化检测影像数据集，在该数据集中按照8∶2的比例划分训练集(共3560幅影像)和测试集(共890幅影像)。After the above processing, the front and rear time-series images and their corresponding actual change images are obtained, and each front and rear time-series images and their corresponding actual change images are composed of urban building automatic detection image data sets, and the training set is divided into the training set according to the ratio of 8:2. (a total of 3560 images) and a test set (a total of 890 images).

S2：建筑物自动化检测模型构建：构建由编码器和解码器组成的孪生交叉注意力判别网络作为建筑物自动化检测模型；S2: Building automatic detection model construction: Construct a twin cross-attention discriminant network composed of an encoder and a decoder as a building automatic detection model;

如图2所示，本发明实施例的建筑物自动化检测模型包括两个主要模块：编码器和解码器。其中，编码器包含权重共享的双通道孪生网络和孪生交叉注意力模块，解码器包含多尺度特征融合以及差分上下文判别模块。As shown in Fig. 2, the building automation detection model of the embodiment of the present invention includes two main modules: an encoder and a decoder. Among them, the encoder includes a weight-sharing dual-channel Siamese network and a Siamese cross-attention module, and the decoder includes a multi-scale feature fusion and differential context discrimination module.

编码器负责提取输入影像中的多尺度特征信息以及高层语义信息。解码器将提取到多尺度特征进行渐进式融合，并且结合上下文差分信息计算概率损失，不断推进结果图接近于Ground Truth。The encoder is responsible for extracting multi-scale feature information and high-level semantic information in the input image. The decoder will extract the multi-scale features for progressive fusion, and calculate the probability loss in combination with the context difference information, so as to continuously push the result map close to the Ground Truth.

如图2中(a)所示，首先采用权重共享的双通道孪生网络对输入双时相影像进行批处理归一化操作，包括卷积核3、步长为1的二维卷积、二维BatchNorm 和输出通道数为64的ReLU激活函数。然后通过下采样块提取特征信息，定义x^i，j为下采样块的输出节点，下采样块的目标函数为：As shown in (a) of Figure 2, firstly, a weight-sharing dual-channel Siamese network is used to perform batch normalization operations on the input dual-temporal images, including convolution kernel 3, two-dimensional convolution with a step size of 1, two-dimensional Dimension BatchNorm and ReLU activation function with 64 output channels. Then extract feature information through the downsampling block, define x ^{i, j} as the output node of the downsampling block, and the objective function of the downsampling block is:

其中，N(·)表示嵌套的卷积函数，D(·)表示下采样层，U(·)表示上采样层，[]表示特征连接函数，x^i，j代表输出特征图，i代表层数，j代表该层的第j个卷积层， k代表第k个连接层。为了更好地描述网络参数，定义三个下采样块的输出通道数分别为128、256和512。最终，孪生网络通道输出四种多尺度特征信息。Among them, N(·) represents the nested convolution function, D(·) represents the downsampling layer, U(·) represents the upsampling layer, [] represents the feature connection function, x ^{i, j} represents the output feature map, i represents The number of layers, j represents the jth convolutional layer of the layer, and k represents the kth connection layer. In order to better describe the network parameters, the number of output channels of the three downsampling blocks is defined as 128, 256 and 512, respectively. Finally, Siamese network channels output four kinds of multi-scale feature information.

如图2中(b)所示，孪生交叉注意力模块对权重共享的双通道孪生网络的四个输出进行嵌入操作，首先进行一次2D卷积提取特征，然后将特征展开成二维序列T₁，T₂，T₃以及T₄，其patch大小分别为32、16、8和4。将T₁-T₄进行合并得到T_∑。As shown in (b) of Figure 2, the Siamese cross-attention module embeds the four outputs of the weight-shared dual-channel Siamese network. First, a 2D convolution is performed to extract features, and then the features are expanded into a two-dimensional sequence T ₁ , T ₂ , T ₃ and T ₄ , whose patch sizes are 32, 16, 8 and 4, respectively. Combine T ₁ -T ₄ to get T _∑ .

如图3所示，孪生交叉注意力模块模块中利用多头交叉注意力机制提取更深层次的变化特征语义信息，提高对特征信息的全局关注度。多头交叉注意力第一阶段的目标函数为：As shown in Figure 3, the twin cross-attention module uses the multi-head cross-attention mechanism to extract deeper semantic information of changing features and improve the global attention to feature information. The objective function of the first stage of multi-head cross attention is:

Q_u＝T_lW_Qi，K＝T_∑W_K，V＝T_∑W_V Q _u ＝T _l W _Qi , K＝T _∑ W _K , V＝T _∑ W _V

其中，

W_K和W_V为不同输入的权重系数，T_l表示特征信息令牌，l表示第 l个尺度的特征信息，T_∑表示四个令牌的特征联合。得到查询向量Q_u(u＝1，2，3，4)，查询键K，查询值V。四个查询向量的通道数分别为[64，128，256，512]。in,

W _K and W _V are the weight coefficients of different inputs, T _l represents the feature information token, l represents the feature information of the l-th scale, T _∑ represents the feature union of four tokens. A query vector Q _u (u=1, 2, 3, 4), a query key K, and a query value V are obtained. The channel numbers of the four query vectors are [64, 128, 256, 512], respectively.

由于全局注意力机制会导致网络的时间复杂度较大，采用转置注意力机制降低网络的计算量。其中

和V^T分别为查询向量Q_u和查询值V的转置。因此多头交叉注意力第二阶段的目标函数为：Since the global attention mechanism will lead to a large time complexity of the network, the transposed attention mechanism is used to reduce the computational load of the network. in

and V ^T are the transpose of query vector Q _u and query value V respectively. Therefore, the objective function of the second stage of multi-head cross attention is:

确定多头交叉注意力第二阶段的输出，其中，σ(·)和

分别表示softmax 函数和实例规范化函数，

W_K和

为不同输入的权重系数，

表示特征信息令牌，l表示第l个尺度的特征信息，T_∑表示四个令牌的特征联合。C_∑表示通道数的总和。Determine the output of the second stage of multi-head cross attention, where σ( ) and

represent the softmax function and the instance normalization function, respectively,

W _K and

are the weight coefficients for different inputs,

Represents the feature information token, l represents the feature information of the l-th scale, and T _∑ represents the feature union of four tokens. C _Σ represents the sum of the number of channels.

其中，CA_h表示多头交叉注意力第二阶段的输出(h＝1，2，3，4)，h表示第h个注意力头的输出，N为注意力头的个数，经过实验证明，当设置N为4时网络的检测效果最好。Among them, CA _h represents the output of the second stage of multi-head cross attention (h=1, 2, 3, 4), h represents the output of the hth attention head, and N is the number of attention heads. It has been proved by experiments that When N is set to 4, the detection effect of the network is the best.

O_r＝MCA_p+MLP(Q_u+MCA_p)O _r ＝MCA _p +MLP(Q _u +MCA _p )

确定多头交叉注意力最终的输出，其中，MCA_p表示多头交叉注意力第三阶段的输出，p表示第p个输出，MLP(·)为多层感知机函数，Q_u表示查询向量，u表示第u个查询向量(u＝1，2，3，4)。最终得到四个输出O₁，O₂，O₃和O₄。Determine the final output of multi-head cross-attention, where MCA _p represents the output of the third stage of multi-head cross-attention, p represents the pth output, MLP( ) is the multi-layer perceptron function, Qu represents the query vector, and _u represents The uth query vector (u=1, 2, 3, 4). Finally four outputs O ₁ , O ₂ , O ₃ and O ₄ are obtained.

如图2(c)所示，所述多尺度特征融合模块采用重建与上采样块的双重渐进式融合策略，将提取到的含有丰富多尺度语义信息的特征进行融合。重建策略首先将孪生交叉注意力模块中的四个嵌入式令牌T₁，T₂，T₃和T₄以及多头交叉注意力机制中的四个输出O₁，O₂，O₃和O₄进行融合。As shown in Figure 2(c), the multi-scale feature fusion module adopts a double progressive fusion strategy of reconstruction and up-sampling blocks to fuse the extracted features containing rich multi-scale semantic information. The reconstruction strategy first combines the four embedded tokens T ₁ , T ₂ , T ₃ and T ₄ in the Siamese cross-attention module and the four outputs O ₁ , O ₂ , O ₃ and O ₄ in the multi-head cross-attention mechanism Perform fusion.

重建策略中的目标函数为：The objective function in the reconstruction strategy is:

M_i＝W₁·V(T_l)+W₂·V(O_r)M _i =W ₁ ·V(T _l )+W ₂ ·V(O _r )

其中，W₁和W₂是两个线性层的权重参数，T_l表示特征信息令牌，l表示第l个尺度的特征信息，O_r表示多头交叉注意力模块的输出，r表示第r个注意力头的输出(r＝1，2，3，4)。得到四个输出M₁，M₂，M₃和M₄。Among them, W ₁ and W ₂ are the weight parameters of two linear layers, T _l represents the feature information token, l represents the feature information of the l-th scale, O _r represents the output of the multi-head cross-attention module, r represents the r-th The output of the attention head (r = 1, 2, 3, 4). Four outputs M ₁ , M ₂ , M ₃ and M ₄ are obtained.

为了更好地融合多尺度特征信息，对上述四个输出进行上采样块操作，四个上采样块的输出通道分别为256、128、64、64。上采样块包含卷积核大小为2 的二维卷积，平均池化层以及激活函数ReLu。最后，我们对第四个上采样块的输出结果进行一次卷积核为1，步长为1的一维卷积，得到检测影像。In order to better integrate multi-scale feature information, the above four outputs are subjected to upsampling block operation, and the output channels of the four upsampling blocks are 256, 128, 64, and 64, respectively. The upsampling block consists of a 2D convolution with a kernel size of 2, an average pooling layer, and an activation function ReLu. Finally, we perform a one-dimensional convolution with a convolution kernel of 1 and a step size of 1 on the output of the fourth upsampling block to obtain the detection image.

如图2(d)所示，所述差分上下文判别模块包含生成器和判别器。生成器中接收两个输入，多尺度特征融合模块最后一层的得到的检测影像以及第一和第二时相做差分运算得到的生成影像。计算二者的损失以推动结果更加接近于实际变化影像。在生成器中采用SCAD和最小二乘LSGAN损失函数的加权和作为损失函数，降低模型的误监率。在判别器中采用最小二乘LSGAN损失函数，改善检测精度。将生成器和判别器的损失函数累加得到最终的概率损失。差分上下文判别模块的目标函数为：As shown in Figure 2(d), the differential context discrimination module includes a generator and a discriminator. The generator receives two inputs, the detected image obtained by the last layer of the multi-scale feature fusion module and the generated image obtained by the differential operation of the first and second phases. The loss of both is calculated to push the result closer to the actual changing image. In the generator, the weighted sum of SCAD and least squares LSGAN loss function is used as the loss function to reduce the false detection rate of the model. The least squares LSGAN loss function is used in the discriminator to improve the detection accuracy. The loss functions of the generator and the discriminator are accumulated to obtain the final probability loss. The objective function of the differential context discrimination module is:

L(P)＝L(D)+L(G)L(P)=L(D)+L(G)

L(D)＝L_LSGAN(D)L(D)＝L _LSGAN (D)

L(G)＝L_LSGAN(D)+αL_SCAD L(G)＝L _LSGAN (D)+αL _SCAD

定义SCAD损失为：Define the SCAD loss as:

确定SCAD损失，其中，C表示检测种类，v(c)表示检测种类的像素误差值， J_C为损失项，ρ为连续优化的参数。v(c)的定义如下：Determine the SCAD loss, where C represents the detection type, v(c) represents the pixel error value of the detection type, J _C is the loss term, and ρ is a parameter for continuous optimization. v(c) is defined as follows:

本发明中的所述的判别器最小二乘LSGAN损失为：The discriminator least squares LSGAN loss described in the present invention is:

确定判别器的最小二乘LSGAN损失，其中，D(x₁，y)及D(x₁，G(x₁))表示判别器对第一时相影像的输出，G(x₁)表示生成器对第一时相影像的输出，D(x₂，y)及 D(x₂，G(x₂))表示判别器对第二时相影像的输出，G(x₂)表示生成器对第二时相影像的输出，

及

表示第一时相影像的检测期望，

及

表示第二时相影像的检测期望，x₁，x₂分别表示判别器输入的第一和第二时相影像，y表示实际变化影像。Determine the least squares LSGAN loss of the discriminator, where D(x ₁ , y) and D(x ₁ , G(x ₁ )) represent the output of the discriminator for the first phase image, and G(x ₁ ) represents the generated D(x ₂ , y) and D(x ₂ , G(x ₂ )) represent the output of the discriminator to the second phase image, and G(x ₂ ) represents the output of the generator to the The output of the second phase image,

and

Denotes the detection expectation of the first-phase image,

and

本发明中的所述的生成器最小二乘LSGAN损失为：The generator least squares LSGAN loss described in the present invention is:

确定生成器的最小二乘LSGAN损失，其中，

表示第一时相影像的检测期望，

表示第二时相影像的检测期望，D(x₁，G(x₁))表示判别器对第一时相影像的输出，G(x₁)表示生成器对第一时相影像的输出，D(x₂，G(x₂)) 表示判别器对第二时相影像的输出，G(x₂)表示生成器对第二时相影像的输出， x₁，x₂分别表示判别器输入的第一和第二时相影像。Determine the least squares LSGAN loss for the generator, where,

Denotes the detection expectation of the first-phase image,

Denotes the detection expectation of the second phase image, D(x ₁ , G(x ₁ )) represents the output of the discriminator for the first phase image, G(x ₁ ) represents the output of the generator for the first phase image, D(x ₂ , G(x ₂ )) represents the output of the discriminator to the second time phase image, G(x ₂ ) represents the output of the generator to the second time phase image, x ₁ , x ₂ represent the input of the discriminator respectively The first and second phase images of .

因此，差分上下文判别模块的目标函数为：Therefore, the objective function of the differential context discriminant module is:

L(P)＝L(D)+L(G)L(P)=L(D)+L(G)

L(D)＝L_LSGAN(D)L(D)＝L _LSGAN (D)

L(G)＝L_LSGAN(D)+αL_SCAD L(G)＝L _LSGAN (D)+αL _SCAD

其中，L(P)表示概率损失，L(D)表示判别器损失，L(G)表示生成器损失，L_LSGAN(D)表示判别器的最小二乘LSGAN损失，L_LSGAN(G)表示生成器的最小二乘LSGAN损失，L_SCAD表示SCAD损失。α为权重参数，控制两种损失之间的相对重要性。在该目标函数的辅助下，生成器和判别器循环迭代产生概率损失，直到概率损失低于设定阈值后输出检测结果。Among them, L(P) represents the probability loss, L(D) represents the discriminator loss, L(G) represents the generator loss, L _LSGAN (D) represents the least squares LSGAN loss of the discriminator, and L _LSGAN (G) represents the generated LSCAD is the least squares LSGAN loss of the filter, and L _SCAD is the SCAD loss. α is a weight parameter that controls the relative importance between the two losses. With the aid of the objective function, the generator and the discriminator iteratively generate probability loss until the probability loss is lower than the set threshold and output the detection result.

S3：利用S1中的训练集对S2中的建筑物自动化检测模型进行训练，利用训练好的模型实现建筑物变化检测，最后利用建筑物自动化检测模型评价指标对检测结果进行评价；S3: Use the training set in S1 to train the building automatic detection model in S2, use the trained model to realize building change detection, and finally use the building automatic detection model evaluation index to evaluate the detection results;

使用本发明所提出的网络结构在所述S1步骤中构建的LEVIR-CD数据集上进行训练，获得模型权重用于模型评价。训练过程基于PyTorch深度学习框架，软件环境为Ubuntu20.04，硬件环境为3090显卡、显存为24GB。将batchsize设置为8，总共训练次数为100epoch。每次输入包含三幅影像：第一时相影像、第二时相影像以及实际变化影像，一次训练后进行一次测试，网络训练过程中不断学习双时相影像以及真实变化影像中的城市建筑物变化信息。循环迭代直到 epoch达到100，则训练结束。Use the network structure proposed by the present invention to train on the LEVIR-CD data set constructed in the S1 step, and obtain model weights for model evaluation. The training process is based on the PyTorch deep learning framework, the software environment is Ubuntu20.04, the hardware environment is a 3090 graphics card, and the video memory is 24GB. Set the batchsize to 8, and the total number of training times is 100epoch. Each input contains three images: the first temporal image, the second temporal image, and the actual change image. One test is performed after one training session. During the network training process, the urban buildings in the dual temporal image and the real change image are continuously learned. change information. The loop iterates until the epoch reaches 100, then the training ends.

选取精确率(Precision)，召回率(Recall)，综合评价指标(F1-score)交集(IOU)，未变化交集(IOU_0)，总体精度(OA)变化交集(IOU_1)，Kappa 系数(Kappa)作为评价指标，及其评价指标计算公式如下：Select precision rate (Precision), recall rate (Recall), comprehensive evaluation index (F1-score) intersection (IOU), unchanged intersection (IOU_0), overall accuracy (OA) change intersection (IOU_1), Kappa coefficient (Kappa) as The evaluation index and its evaluation index calculation formula are as follows:

为了验证本发明所提的建筑物自动化检测模型的性能，本发明给出了最终的实验结果，图4为各种方法的视觉对比图，表1为各种方法的量化指标。In order to verify the performance of the building automation detection model proposed in the present invention, the present invention provides the final experimental results. Figure 4 is a visual comparison diagram of various methods, and Table 1 is the quantitative indicators of various methods.

其中，图4表示各种方法得到的建筑物检测结果影像。(a)为前序影像，(b) 为后序影像，(c)为实际变化影像(Ground Truth,GT)，(d)-(g)为不同方法的检测结果影像。通过比较实际变化影像，黑色代表未变化区域，白色代表变化区域，红色代表误检测区域，绿色代表漏检测区域。Among them, FIG. 4 shows images of building detection results obtained by various methods. (a) is the previous image, (b) is the subsequent image, (c) is the actual change image (Ground Truth, GT), (d)-(g) are the detection result images of different methods. By comparing the actual changed images, black represents the unchanged area, white represents the changed area, red represents the false detection area, and green represents the missing detection area.

表1：各种方法在LEVIR-CD数据集建筑物检测精度Table 1: Building detection accuracy of various methods in the LEVIR-CD dataset

请注意，所有指标的单位均为百分比，且数值越大，效果越好。为了便于观察，将最好的结果进行加粗表示。Note that all metrics are in percentages and that higher numbers are better. For ease of observation, the best results are bolded.

需要指出，根据实施的需要，可将本申请中描述的各个步骤/部件拆分为更多步骤/部件，也可将两个或多个步骤/部件或者步骤/部件的部分操作组合成新的步骤/部件，以实现本发明的目的。It should be pointed out that according to the needs of implementation, each step/component described in this application can be split into more steps/components, and two or more steps/components or part of the operations of steps/components can also be combined into a new Step/component, to realize the object of the present invention.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而以，并不以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本发明的保护范围内。It is easy for those skilled in the art to understand that the above description is only for preferred embodiments of the present invention, and is not intended to limit the present invention, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention , are all included in the protection scope of the present invention.

Claims

1. A building change detection method for urban dynamic monitoring, characterized in that, comprising the following steps:

S1, using the images of urban buildings collected by remote sensing satellites as a data set, obtaining the actual change images corresponding to each building in the data set, and dividing the actual change images and the corresponding dual-temporal images into a training set and a test set;

S2. Build an automatic building detection model consisting of an encoder and a decoder. The encoder includes a weight-sharing dual-channel Siamese network and a Siamese cross-attention module. The decoder includes a multi-scale feature fusion and differential context discrimination module. ;

The weight-shared dual-channel Siamese network includes a batch normalization layer and multiple upsampling blocks, inputting bitemporal images for obtaining feature maps of different scales;

The twin cross-attention module first performs embedding operations on feature maps of different scales, and then uses the multi-head cross-attention mechanism to extract deeper semantic information of changing features to improve global attention to feature information;

The multi-scale feature fusion module adopts a double progressive fusion strategy of reconstruction and up-sampling blocks to fuse the extracted features containing rich multi-scale semantic information;

The input of the differential context discrimination module is the output image of the multi-scale fusion module and the front and rear temporal difference images. The purpose is to combine the context information in the image to improve the discrimination ability of the network, so that the detection result image is closer to the real change image, thereby Improve detection accuracy;

S3, using the training set in S1 to train the building automatic detection model in S2, and using the trained model to realize building change detection.

2. The method according to claim 1, wherein step S1 comprises:

Using artificially produced urban building change images as a data set, the actual change images are produced according to the bitemporal images in the dataset, where the actual change images are the change areas in the bitemporal images, and each pixel in the actual change images Represents a class, as unchanged or changed;

The before and after time-series images and their corresponding actual change images are composed of urban building automatic detection image data set, in which the training set and test set are divided according to the ratio of 8:2.

3. The method according to claim 1, characterized in that: in step S2, the weight-shared dual-channel Siamese network performs a batch normalization operation on the input dual-temporal images, including a convolution kernel 3 and a step size of 1 Two-dimensional convolution, two-dimensional BatchNorm and ReLU activation function with 64 output channels, then extract feature information through three downsampling blocks, define x ^{i, j} as the output node of the downsampling block, and the objective function of the downsampling block is :

Among them, N(·) represents the nested convolution function, D(·) represents the downsampling layer, U(·) represents the upsampling layer, [] represents the feature connection function, x ^i,j represents the output feature map, and i represents The number of layers, j represents the jth convolutional layer of the layer, and k represents the kth connection layer; finally, the Siamese network channel outputs four kinds of multi-scale feature information.

4. The method according to claim 1, characterized in that: the twin-cross attention module in step S2 performs an embedding operation on the four outputs of the dual-channel twin network, first performs a 2D convolution to extract features, and then expands the features into a two-dimensional sequence T ₁ , T ₂ , T ₃ and T ₄ , whose patch sizes are 32, 16, 8 and 4 respectively, and T ₁ -T ₄ are merged to obtain T _∑ , and then processed by a multi-head cross-attention mechanism , the objective function of the first stage is:

K＝T _∑ W _K , V＝T _∑ W _V

in,

W _K and W _V are the weight coefficients of different inputs, T _l represents the feature information token, l represents the feature information of the lth scale, T _∑ represents the feature union of four tokens, and the query vector _Qu is obtained, and the query key K , query value V, l=1,2,3,4, u=1,2,3,4;

The objective function of the second stage is:

Among them, σ(·) and

The objective function of the third stage of multi-head cross attention is:

Among them, CA _h represents the output of the second stage of multi-head cross attention, h represents the output of the hth attention head, and N is the number of attention heads;

The objective function of the final stage of multi-head cross attention is:

O _r ＝MCA _p +MLP(Q _u +MCA _p )

Determine the final output of multi-head cross-attention, where MCA _p represents the output of the third stage of multi-head cross-attention, p represents the pth output, MLP( ) is the multi-layer perceptron function, Qu represents the query vector, and _u represents uth query vector.

5. The method according to claim 4, characterized in that: in step S2, the objective function of the multi-scale feature fusion module is:

M _i =W ₁ ·V(T _l )+W ₂ ·V(O _r )

Among them, W ₁ and W ₂ are the weight parameters of two linear layers, T _l represents the feature information token, l represents the feature information of the l-th scale, O _r represents the output of the multi-head cross-attention module, r represents the r-th The output of the attention head.

6. The method according to claim 1, characterized in that: in step S2, the differential context discrimination module includes a generator and a discriminator, two inputs are received in the generator, and the last layer of the multi-scale feature fusion module obtains The detection image and the generated image obtained by the difference operation of the first and second phases, calculate the loss of the two to push the result closer to the actual changing image, and use the weighted sum of SCAD and the least squares LSGAN loss function in the generator As a loss function, the false positive rate of the model is reduced; the least squares LSGAN loss function is used in the discriminator to improve the detection accuracy, and the loss functions of the generator and the discriminator are accumulated to obtain the final probability loss.

7. The method according to claim 6, characterized in that: in step S2, the objective function of the differential context discrimination module is:

L(P)=L(D)+L(G)

L(D)＝L _LSGAN (D)

L(G)＝L _LSGAN (D)+αL _SCAD

Among them, L(P) represents the probability loss, L(D) represents the discriminator loss, L(G) represents the generator loss, L _LSGAN (D) represents the least squares LSGAN loss of the discriminator, and L _LSGAN (G) represents the generated LSCAD is the least squares LSGAN loss of the filter, and L _SCAD is the SCAD loss.

8. method according to claim 7, is characterized in that: define SCAD loss as:

Among them, C represents the detection type, v(c) represents the pixel error value of the detection type, J _C is the loss item, ρ is the parameter for continuous optimization, and v(c) is defined as follows:

Among them, y _i is the actual change image, s _g (c) is the detection score, and g represents the gth pixel.

9. The method according to claim 7, characterized in that: the least squares LSGAN loss is:

Among them, D(x ₁ ,y) and D(x ₁ ,G(x ₁ )) represent the output of the discriminator to the first phase image, G(x ₁ ) represents the output of the generator to the first phase image, D(x ₂ ,y) and D(x ₂ ,G(x ₂ )) represent the output of the discriminator to the second phase image, G(x ₂ ) represents the output of the generator to the second phase image,

and

Denotes the detection expectation of the first-phase image,

and

10. The method according to claim 7, characterized in that: the least squares LSGAN loss is:

in,

Denotes the detection expectation of the first-phase image,

Represents the detection expectation of the second phase image, D(x ₁ ,G(x ₁ )) represents the output of the discriminator for the first phase image, G(x ₁ ) represents the output of the generator for the first phase image, D(x ₂ , G(x ₂ )) represents the output of the discriminator to the second time-phase image, G(x ₂ ) represents the output of the generator to the second time-phase image, x ₁ , x ₂ represent the input of the discriminator respectively The first and second phase images of .