CN117036539A

CN117036539A - Virtual viewpoint drawing cavity filling method and device based on deep learning

Info

Publication number: CN117036539A
Application number: CN202311168862.0A
Authority: CN
Inventors: 刘家希; 周洋; 谭子丰; 殷海兵; 唐向宏; 黄晓峰
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-11-10

Abstract

The invention discloses a virtual viewpoint drawing cavity filling method and device based on deep learning, which are characterized in that a graph to be repaired is obtained based on an original graph and a mask graph thereof, the graph to be repaired and the mask graph are input into a progressive iterative network, the progressive iterative network carries out local cavity recognition on the graph to be repaired through partial convolution, cavity filling is carried out based on a knowledge consistency attention mechanism, and a loss function is constructed to promote semantic consistency of a background cavity area and a known area; constructing a context characteristic propagation loss, merging the loss function constructed in the step S1, outputting a progressive iterative network, performing similarity coding to obtain the similarity of the image blocks and the image blocks in the non-cavity area, and generating a filling block with semantic consistency by the background cavity based on the similarity; and carrying out weighted combination on the output of the progressive iterative network to obtain a final repair graph.

Description

A method and device for filling holes in virtual viewpoint rendering based on deep learning

技术领域Technical field

本发明属于深度学习、虚拟视点绘制技术领域，具体涉及一种基于深度学习的虚拟视点绘制空洞填充方法及装置。The invention belongs to the technical fields of deep learning and virtual viewpoint rendering, and specifically relates to a method and device for filling holes in virtual viewpoint rendering based on deep learning.

背景技术Background technique

自由视点视频、立体电视和虚拟现实等新兴的三维(three-dimension，3D)多媒体视觉服务能给用户带来沉浸式和交互式的视觉体验，越来越受到人们的关注和喜爱。但是表征交互式3D视频需要大量的视点信息，由于采集成本和带宽的限制，在实际应用中只能采集和传输有限个视点的场景。目前，多视点加深度(multi-view plus depth，MVD)编码格式是压缩3D视频和自由视点视频的主流格式，它在解码端通过基于深度图像的绘制(depthimage based rendering，DIBR)技术从MVD视频中绘制出需要的虚拟视点来弥补视点数目的不足。然而在绘制虚拟视点时，不同视点间由于存在前后景遮挡等问题会导致绘制图像中存在空洞、裂纹等缺失区域，因此需要对绘制图像的缺失区域进行修补。Emerging three-dimensional (3D) multimedia visual services such as free-viewpoint video, stereoscopic TV, and virtual reality can bring immersive and interactive visual experiences to users, and are increasingly attracting people's attention and love. However, characterizing interactive 3D videos requires a large amount of viewpoint information. Due to collection cost and bandwidth limitations, only a limited number of viewpoint scenes can be collected and transmitted in practical applications. Currently, the multi-view plus depth (MVD) encoding format is the mainstream format for compressing 3D videos and free-viewpoint videos. It uses depthimage based rendering (DIBR) technology on the decoding end to convert MVD videos from The required virtual viewpoints are drawn to make up for the lack of number of viewpoints. However, when drawing virtual viewpoints, problems such as front and rear occlusion between different viewpoints will cause holes, cracks and other missing areas in the drawn image. Therefore, the missing areas of the drawn image need to be repaired.

传统的虚拟视点绘制空洞填充技术，主要基于空域一致性和基于时域一致性两类。基于空域一致性的技术主要包括使用滤波器和基于补丁的方法；基于时域一致性的技术提出利用背景模块来构建背景的空洞区域。例如专利号CN201310017391.3的专利文献中提出一种新的基于深度图渲染技术的虚拟视点合成的方法。通过3D图像变换获得虚拟视点深度图像，对该深度图像进行优化处理；然后根据优化处理得到的深度图进行逆向3D图像变换，获得虚拟视点彩色图像；最后通过基于深度信息的图像修复算法进行空洞填充。通过逆向3D图像变换，可以避免虚拟视点彩色图像上出现裂纹，提高虚拟视点图像的质量；另外，针对虚拟视点图像中的非遮挡空洞区域，采用图像修复的方法进行填充，可以保证渲染出的图像产生最佳的显示效果。上述专利仍然是基于空域一致性的搜索单视图的补丁方法进行填充，容易赋予前背景相同权重，伪影现象严重。Traditional virtual viewpoint rendering hole filling technologies are mainly based on spatial domain consistency and temporal consistency. Technologies based on spatial domain consistency mainly include the use of filters and patch-based methods; technologies based on temporal consistency propose to use background modules to construct the hole region of the background. For example, the patent document No. CN201310017391.3 proposes a new virtual viewpoint synthesis method based on depth map rendering technology. The virtual viewpoint depth image is obtained through 3D image transformation, and the depth image is optimized; then a reverse 3D image transformation is performed based on the depth map obtained by the optimization processing to obtain a virtual viewpoint color image; finally, the hole is filled through an image repair algorithm based on depth information. . Through reverse 3D image transformation, cracks on the virtual viewpoint color image can be avoided and the quality of the virtual viewpoint image can be improved; in addition, the non-occlusion hole area in the virtual viewpoint image is filled with image repair methods to ensure that the rendered image Produce the best display effect. The above-mentioned patent still uses a single-view patch method for filling based on spatial consistency, which easily gives the foreground and background the same weight, resulting in serious artifacts.

现有技术中，基于空域一致性技术中的滤波器，只能对绘制过程中出现的裂缝和小基线空洞进行修复，基于补丁的方法会对相似补丁块进行搜索和匹配，由于搜索算法的精确度不高，容易赋予前背景相同的权重；而基于时域一致性的方法中，利用重建的背景图来填补虚拟视图，且步骤繁多，而且场景包含运动物体，易造成前景被建模为背景，导致前背景像素混叠以及伪影现象。In the existing technology, filters based on spatial consistency technology can only repair cracks and small baseline holes that appear during the drawing process. Patch-based methods will search and match similar patch blocks. Due to the accuracy of the search algorithm, The degree is not high, and it is easy to give the same weight to the foreground and background; in the method based on temporal consistency, the reconstructed background image is used to fill the virtual view, and there are many steps, and the scene contains moving objects, which easily causes the foreground to be modeled as the background , leading to aliasing of foreground and background pixels and artifacts.

发明内容Contents of the invention

为解决现有技术的不足，本发明采用端到端的深度学习技术，实现虚拟视点绘制空洞填充的目的，本发明采用如下的技术方案：In order to solve the deficiencies of the existing technology, the present invention adopts end-to-end deep learning technology to achieve the purpose of filling holes in virtual viewpoint rendering. The present invention adopts the following technical solutions:

一种基于深度学习的虚拟视点绘制空洞填充方法，包括如下步骤：A method for filling holes in virtual viewpoint rendering based on deep learning, including the following steps:

步骤S1：基于原图及其掩膜图，得到待修复图，将待修复图和掩模图输入渐进式迭代网络，渐进式迭代网络通过部分卷积对待修复图进行局部空洞识别，并基于知识一致注意力机制进行空洞填充，构建损失函数以提升背景空洞区域和已知区域的语义一致性；Step S1: Based on the original image and its mask image, obtain the image to be repaired, and input the image to be repaired and the mask image into the progressive iterative network. The progressive iterative network uses partial convolution to identify local holes in the image to be repaired, and based on knowledge The consistent attention mechanism performs hole filling and constructs a loss function to improve the semantic consistency between background hole areas and known areas;

步骤S2：构建上下文特征传播损失，融入步骤S1中构建的损失函数，以提高特征匹配的鲁棒性，将渐进式迭代网络的输出，进行相似性编码，得到图像块和非空洞区域图像块的相似度，基于相似度，使得背景空洞生成具有语义一致性的填充块；Step S2: Construct the context feature propagation loss and integrate it into the loss function constructed in step S1 to improve the robustness of feature matching. Perform similarity coding on the output of the progressive iterative network to obtain the image blocks and non-hole area image blocks. Similarity, based on similarity, enables background holes to generate filling blocks with semantic consistency;

步骤S3：将渐进式迭代网络的输出，进行加权合并，得到最终的修复图。当渐进式迭代次数达到设定的阈值时，空洞区域填充完成，但如果直接使用此时输出的特征图，则会存在梯度消失和中间生成特征丢失问题，如果采用平均合并和自适应合并，则早期输出的重建图像中的缺失区域会影响最终输出图像的质量，为了解决上述问题，通过加权合并融合每次渐进式迭代生成的特征图。Step S3: Perform weighted merging of the outputs of the progressive iterative network to obtain the final repair map. When the number of progressive iterations reaches the set threshold, the filling of the hole area is completed. However, if the feature map output at this time is used directly, there will be problems of gradient disappearance and intermediate generated feature loss. If average merging and adaptive merging are used, then The missing areas in the reconstructed images output early will affect the quality of the final output image. In order to solve the above problem, the feature maps generated by each progressive iteration are fused through weighted merging.

进一步地，所述步骤S1中，部分卷积只使用空洞区域中的有效像素进行运算，更新后的掩模在整个迭代过程中被保留，直到下一次迭代时被缩小更新，有利于浅层有效特征的提取。Furthermore, in step S1, some convolutions only use valid pixels in the hole area for operation, and the updated mask is retained throughout the iteration process until it is reduced and updated in the next iteration, which is conducive to effective shallow layers. Feature extraction.

进一步地，所述步骤S1中的知识一致注意力机制，首先，通过归一化已知特征补丁向量和生成特征补丁向量间的内积来获得每个补丁的注意力分数；然后，根据上一次迭代生成的掩膜像素值是否有效，对最终注意力分数加权；最后，利用注意力分数更新缺失补丁的特征，并通过卷积层来提高输入特征和重建特征的结构一致性。注意力机制可以提取语义合理且像素信息更准确的特征图，而知识一致注意力模块的优势在于注意力分数是通过当前注意力分数和上一次迭代的分数的加权和来衡量，使前后帧补丁间具有相关性，从而解决传统方法中基于补丁块填充的前背景权重相同的问题。Further, the knowledge-consistent attention mechanism in step S1 first obtains the attention score of each patch by normalizing the inner product between the known feature patch vector and the generated feature patch vector; then, based on the previous Whether the iteratively generated mask pixel values are valid, the final attention score is weighted; finally, the attention score is used to update the features of the missing patch, and the convolutional layer is used to improve the structural consistency of the input features and reconstructed features. The attention mechanism can extract feature maps with reasonable semantics and more accurate pixel information, while the advantage of the knowledge-consistent attention module is that the attention score is measured by the weighted sum of the current attention score and the score of the previous iteration, so that the preceding and following frame patches There is correlation between them, thereby solving the problem of the same foreground and background weights based on patch filling in traditional methods.

进一步地，所述步骤S1中，损失函数融合了L₁损失、感知损失、风格损失与平滑损失；所述L₁损失包括空洞区域的损失L_valid和空洞区域的损失L_hole：Further, in step S1, the loss function combines L ₁ loss, perceptual loss, style loss and smoothing loss; the L ₁ loss includes the loss of the hole area L _valid and the loss of the hole area L _hole :

L₁＝λ₁L_valid+λ₂L_hole L ₁ =λ ₁ L _valid +λ ₂ L _hole

其中，λ₁和λ₂分别表示L_valid和L_hole的权重，M是二值化图，分为0表示空洞区域、1表示有效区域，⊙表示点乘，I_gt表示原图，表示第t次渐进式迭代网络输出的预测图，C、H、W分别是通道数、特征补丁的高度和宽度；Among them, λ ₁ and λ ₂ represent the weights of L _valid and L _hole respectively, M is a binary image, divided into 0 to represent the hole area, 1 to represent the effective area, ⊙ represents the dot product, and I _gt represents the original image. Represents the prediction map output by the t-th progressive iteration network, C, H, and W are the number of channels, the height and width of the feature patch respectively;

所述感知损失用于增强填充图像和真实图像间高级特征结构的相似性，定义为：The perceptual loss is used to enhance the similarity of high-level feature structures between filled images and real images, defined as:

其中，I_o是输出结果图，由中的空洞生成像素和I_gt中的非空洞像素组成；ψ_m(I_gt)和ψ_m(I_o)分别表示I_gt和I_o经过预训练特征提取网络VGG-16的第m个池化层后输出的特征；H_m、W_m和C_m表示提取到的第m特征图的高度、宽度和通道数；N₂是池化层的数目；Among them, I _o is the output result graph, represented by It consists of hole-generated pixels in and non-hole pixels in I _gt ; ψ _m (I _gt ) and ψ _m (I _o ) respectively represent the m-th pooling of I _gt and I _o through the pre-trained feature extraction network VGG-16 Features output after the layer; H _m , W _m and C _m represent the height, width and number of channels of the extracted m-th feature map; N ₂ is the number of pooling layers;

所述风格损失用于补偿感知损失不能有效保持填充区域与周边区域的风格一致性问题，定义为：The style loss is used to compensate for the problem that perceptual loss cannot effectively maintain the style consistency between the filled area and the surrounding area, and is defined as:

其中，是通过计算特征图的格拉姆矩阵Gram来获得网络中每层特征的相似性，即/> in, The similarity of the features of each layer in the network is obtained by calculating the Gram matrix Gram of the feature map, that is/>

所述平滑损失用于保持填充后图像的光滑性，定义为：The smoothing loss is used to maintain the smoothness of the filled image and is defined as:

其中，P_i,j表示I_o中的一个像素点，P_i,j+1和P_i+1,j分别表示P_i,j垂直方向和水平方向的相邻像素点，R是空洞区域中像素为1的膨胀区域，N_c是I_o的图像块总数量。Among them, P _i,j represents a pixel in I _o , P _i,j+1 and P _i+1,j respectively represent the adjacent pixels of P _i,j in the vertical and horizontal directions, and R is the pixel in the hole area. The expansion area with a pixel of 1, N _c is the total number of image patches of I _o .

进一步地，所述步骤S2中，通过相似性编码卷积计算生成图像块和非空洞区域图像块的相似度，其中卷积核获取操作是通过提取自身的补丁来处理背景特征图，大小设置为步长为2的4×4卷积核，避免造成棋盘效应。Further, in step S2, the similarity between the image block and the non-hole area image block is calculated through similarity coding convolution, where the convolution kernel acquisition operation is to process the background feature map by extracting its own patch, and the size is set to A 4×4 convolution kernel with a stride of 2 to avoid the checkerboard effect.

进一步地，所述步骤S2中，将待修复图进行辅助编码，并将编码结果结合相似度进行辅助解码，得到辅助图像，基于辅助图像指引，使得背景空洞生成具有语义一致性的填充块。Further, in the step S2, the image to be repaired is auxiliary encoded, and the encoding result is combined with the similarity for auxiliary decoding to obtain an auxiliary image. Based on the auxiliary image guidance, the background hole is generated into a filling block with semantic consistency.

进一步地，所述步骤S2中，将上下文特征传播损失定义为辅助图像的空洞填充损失，可以看作是辅助图像块局部损失的总和，即Further, in step S2, the context feature propagation loss is defined as the hole filling loss of the auxiliary image, which can be regarded as the sum of the local losses of the auxiliary image block, that is,

其中，i表示迭代输出的I_out中相似度S最高的图像块索引，u_i表示相似度S最高的辅助图像块，l_i(·)是第i个辅助图像块的局部空洞损失。Among them, i represents the image block index with the highest similarity S in the iterative output I _out , u _i represents the auxiliary image block with the highest similarity S, and l _i (·) is the local hole loss of the i-th auxiliary image block.

进一步地，所述步骤S3中，自适应合并通过分析掩膜生成权重，而无需学习过程，包括如下步骤：Further, in step S3, adaptive merging generates weights by analyzing masks without a learning process, including the following steps:

首先划分特征图：First divide the feature map:

其中，t是迭代次数，和/>分别是第t次和第t-1次掩膜(mask)迭代更新后的特征图，为每次渐进式迭代生成的输出图像/>构造一个权重映射/>其公式如下：Among them, t is the number of iterations, and/> They are the feature maps updated after the t-th and t-1th mask iterations respectively, and are the output images generated for each progressive iteration/> Construct a weight map/> The formula is as follows:

其中，j是本次待修复的图像块数目，N₁是待修复的图像块总数目，表示第t次迭代输出图像/>对应的第j个修复区域的权重映射，n为本次修复过程中生成的图像块总数量；Among them, j is the number of image blocks to be repaired this time, N ₁ is the total number of image blocks to be repaired, Represents the t-th iteration output image/> The weight mapping of the corresponding j-th repair area, n is the total number of image blocks generated during this repair process;

然后，使用归一化函数得到权重映射值w_j，最终输出图I_o由第t次迭代输出特征图在位置(x,y)的加权和组成：Then, use the normalization function to obtain the weight mapping value w _j , and the final output map I _o is the feature map output by the t-th iteration. The weighted sum at position (x, y) consists of:

其中，H和W分别表示第t次迭代输出图像/>的高度和宽度，δ表示激活函数，/>表示第t次迭代输出图像/>的特征值。in, H and W respectively represent the t-th iteration output image/> The height and width of , δ represents the activation function, /> Represents the t-th iteration output image/> eigenvalues.

进一步地，所述步骤S3中，加权合并是由软权重映射和自适应合并的输出特征图拼接后，经过学习过程得到的自适应图，拼接操作可以在不破坏原始特征图的情况下整体地保留特征的信息，软权重映射是将迭代输出I_out和输入特征映射I_in连接起来得到的：Further, in step S3, weighted merging is an adaptive map obtained through a learning process after splicing the output feature maps of soft weight mapping and adaptive merging. The splicing operation can be performed as a whole without destroying the original feature map. Preserving feature information, the soft weight map is obtained by connecting the iterative output I _out and the input feature map I _in :

W_s＝σ(C_onv([I_in,I_out]))·(1-M)+MW _s =σ(C _onv ([I _in ,I _out ]))·(1-M)+M

其中，σ是激活函数，M是二值化图，分为0表示空洞区域、1表示有效区域，软权重得到的特征图的值I_s表示如下：Among them, σ is the activation function, M is the binary map, which is divided into 0 to represent the hole area and 1 to represent the effective area. The value I _s of the feature map obtained by soft weight is expressed as follows:

一种基于深度学习的虚拟视点绘制空洞填充装置，包括存储器和一个或多个处理器，所述存储器中存储有可执行代码，所述一个或多个处理器执行所述可执行代码时，用于实现所述的一种基于深度学习的虚拟视点绘制空洞填充方法。A device for filling holes in virtual viewpoint rendering based on deep learning, including a memory and one or more processors, with executable code stored in the memory. When the one or more processors execute the executable code, To implement the deep learning-based virtual viewpoint rendering hole filling method.

本发明的优势和有益效果在于：The advantages and beneficial effects of the present invention are:

本发明的一种基于深度学习的虚拟视点绘制空洞填充方法及装置，通过端到端的深度学习技术，相较传统技术缩短了人工成本；加入知识一致注意力模块和上下文传播特征损失模块，提高了特征匹配的精确度，缓解了前背景混叠和伪影问题；加入加权合并模块，有利于迭代生成的特征图融合且不丢失细节信息。The present invention uses a deep learning-based virtual viewpoint rendering hole filling method and device. Through end-to-end deep learning technology, labor costs are shortened compared with traditional technologies; the knowledge consistent attention module and context propagation feature loss module are added to improve efficiency. The accuracy of feature matching alleviates the problems of foreground and background aliasing and artifacts; adding a weighted merging module facilitates the fusion of iteratively generated feature maps without losing detailed information.

附图说明Description of the drawings

图1是本发明实施例中方法的流程图。Figure 1 is a flow chart of a method in an embodiment of the present invention.

图2是本发明实施例中方法的进一步细化流程图。Figure 2 is a further detailed flow chart of the method in the embodiment of the present invention.

图3是本发明实施例中上下文特征传播损失模块结构示意图。Figure 3 is a schematic structural diagram of the context feature propagation loss module in the embodiment of the present invention.

图4是本发明实施例中加权合并模块结构示意图。Figure 4 is a schematic structural diagram of a weighted merging module in an embodiment of the present invention.

图5a是本发明实施例中自适应合并方法流程图。Figure 5a is a flow chart of an adaptive merging method in an embodiment of the present invention.

图5b是本发明实施例中软权重方法流程图。Figure 5b is a flow chart of the soft weighting method in the embodiment of the present invention.

图6是本发明实施例中装置的结构示意图。Figure 6 is a schematic structural diagram of the device in the embodiment of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的具体实施方式进行详细说明。应当理解的是，此处所描述的具体实施方式仅用于说明和解释本发明，并不用于限制本发明。Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

如图1所示，一种基于深度学习的虚拟视点绘制空洞填充方法，包括如下步骤：As shown in Figure 1, a deep learning-based virtual viewpoint rendering hole filling method includes the following steps:

步骤S1：输入原图I_gt和掩膜图M，得到待修复图I_in，将待修复图I_in和掩膜图M输入渐进式迭代网络(progressive iteration network,PINet)；Step S1: Input the original image I _gt and the mask image M to obtain the image to be repaired I _in , and input the image to be repaired I _in and the mask image M into the progressive iteration network (PINet);

具体地，输入原图I_gt和掩膜图M，得到待修复图I_in，将待修复图I_in和掩膜图M输入PINet。PINet模型是基于U-Net网络的卷积神经网络，包括部分卷积、知识一致注意力模块、上下文特征传播损失模块，具体如图2所示。在网络初始阶段，引入4个7×7部分卷积进行局部空洞识别，同时使用批归一化和激活函数加速模型收敛。部分卷积只使用空洞区域中的有效像素进行运算，更新后的掩膜在整个迭代过程中被保留，直到下一次迭代时被缩小更新，有利于浅层有效特征的提取。Specifically, the original image I _gt and the mask image M are input to obtain the image to be repaired I _in , and the image to be repaired I _in and the mask image M are input into PINet. The PINet model is a convolutional neural network based on the U-Net network, including partial convolution, knowledge consistent attention module, and context feature propagation loss module, as shown in Figure 2. In the initial stage of the network, four 7×7 partial convolutions are introduced for local hole identification, and batch normalization and activation functions are used to accelerate model convergence. Partial convolution only uses effective pixels in the hole area to perform operations. The updated mask is retained throughout the iteration process until it is reduced and updated in the next iteration, which is beneficial to the extraction of shallow effective features.

注意力机制可以提取语义合理且像素信息更准确的特征图。为了解决传统方法中基于补丁块填充的前背景权重相同的问题，本文结合知识一致注意力模块进行空洞填充。知识一致注意力模块的优势在于注意力分数是通过当前注意力分数和上一次迭代的分数的加权和来衡量，使前后帧补丁间具有相关性。它首先通过归一化已知特征补丁向量和生成特征补丁向量间的内积来获得每个补丁的注意力分数；接着，根据上一次迭代生成的掩膜像素值是否为1(表示有效像素)对最终注意力分数加权；最后，利用注意力分数更新缺失补丁的特征，并通过卷积层来提高输入特征和重建特征的结构一致性。The attention mechanism can extract feature maps with reasonable semantics and more accurate pixel information. In order to solve the problem of the same foreground and background weights in traditional methods based on patch filling, this paper combines the knowledge consistent attention module to perform hole filling. The advantage of the knowledge-consistent attention module is that the attention score is measured by the weighted sum of the current attention score and the score of the previous iteration, so that there is correlation between the preceding and following frame patches. It first obtains the attention score of each patch by normalizing the inner product between the known feature patch vector and the generated feature patch vector; then, based on whether the mask pixel value generated in the previous iteration is 1 (indicating a valid pixel) The final attention score is weighted; finally, the attention score is used to update the features of the missing patch, and the convolutional layer is used to improve the structural consistency of the input features and reconstructed features.

在损失函数设计中，本文融合了L₁损失、感知损失、风格损失与平滑损失，以及上下文特征传播损失来提升背景空洞区域和已知区域的语义一致性。In the loss function design, this article combines L ₁ loss, perceptual loss, style loss and smoothing loss, as well as context feature propagation loss to improve the semantic consistency of background hole areas and known areas.

L₁损失是由非空洞区域的损失L_valid和空洞区域的损失L_hole组成，即L ₁ loss is composed of the loss L _valid in the non-hole area and the loss L _hole in the hole area, that is

L₁＝λ₁L_valid+λ₂L_hole (1)L ₁ ＝λ ₁ L _valid +λ ₂ L _hole (1)

式中，λ₁和λ₂分别表示L_valid和L_hole的权重，且In the formula, λ ₁ and λ ₂ represent the weights of L _valid and L _hole respectively, and

式中，M是二值化图，0表示空洞区域，1表示有效区域，⊙表示点乘。I_gt表示原始参考视图，表示第t次渐进式迭代网络输出的预测图，C、H、W分别是通道数、特征补丁的高度和宽度，分别为C＝256，H＝32，W＝32，权重参数设为λ₁＝1，λ₂＝6。In the formula, M is a binary image, 0 represents the hole area, 1 represents the effective area, and ⊙ represents the dot product. I _gt represents the original reference view, Represents the prediction map output by the t-th progressive iteration network. C, H, and W are the number of channels and the height and width of the feature patch. They are C=256, H=32, and W=32 respectively. The weight parameter is set to λ _1. =1, λ ₂ =6.

感知损失用来增强填充图像和真实图像间高级特征结构的相似性，其定义为Perceptual loss is used to enhance the similarity of high-level feature structures between filled images and real images, which is defined as

式中，I_o是输出结果图，由中的空洞生成像素和I_gt中的非空洞像素组成；ψ_m(I_gt)和ψ_m(I_o)分别表示I_gt和I_o经过预训练网络VGG-16第m个池化层后输出的特征(m＝1，2，3)；H_m、W_m和C_m表示提取到的第m特征图的高度、宽度和通道数；N₂是池化层的数目。In the formula, I _o is the output result graph, represented by It consists of hole-generated pixels in and non-hole pixels in I _gt ; ψ _m (I _gt ) and ψ _m (I _o ) respectively represent the output of I _gt and I _o after the mth pooling layer of the pre-trained network VGG-16 features (m=1, 2, 3); H _m , W _m and C _m represent the height, width and number of channels of the extracted m-th feature map; N ₂ is the number of pooling layers.

风格损失用来补偿感知损失不能有效保持填充区域与周边区域的风格一致性问题，定义为：Style loss is used to compensate for the problem that perceptual loss cannot effectively maintain the style consistency between the filled area and the surrounding area, and is defined as:

式中，是通过计算特征图的格拉姆矩阵(Gram)来获得网络中每层特征的相似性，即/> In the formula, The similarity of the features of each layer in the network is obtained by calculating the Gram matrix (Gram) of the feature map, that is/>

平滑损失用来保持填充后图像的光滑性，定义为：Smoothing loss is used to maintain the smoothness of the filled image, defined as:

式中，P_i,j表示I_o中的一个像素点，P_i,j+1和P_i+1,j分别表示P_i,j垂直方向和水平方向的相邻像素点。R是空洞区域中像素为1的膨胀区域，N_c是I_o的图像块总数量。In the formula, Pi _,j represents a pixel in _Io , and Pi _,j+1 and _Pi+1,j represent adjacent pixels in the vertical and horizontal directions of Pi _,j respectively. R is the expansion area with a pixel of 1 in the hole area, and _Nc is the total number of image blocks of _Io .

综上所述，再结合上下文特征传播损失L_CFP，总损失函数可表达为To sum up, combined with the context feature propagation loss L _CFP , the total loss function can be expressed as

式中，λ_per、λ_style、λ_tv、λ_CFP分别是感知损失、风格损失、平滑损失和上下文特征传播损失的权重参数，通过实验得到λ_per＝0.05，λ_style＝120，λ_tv＝0.1，λ_CFP＝0.5。In the formula, λ _per , λ _style , λ _tv , and λ _CFP are the weight parameters of perceptual loss, style loss, smoothing loss and context feature propagation loss respectively. Through experiments, λ _per =0.05, λ _style =120, and λ _tv =0.1 are obtained. , λ _CFP =0.5.

步骤S2：在网络深层阶段加入上下文特征传播损失模块提高特征匹配的鲁棒性；Step S2: Add a context feature propagation loss module in the deep stage of the network to improve the robustness of feature matching;

在网络深层阶段加入上下文特征传播损失模块提高特征匹配的鲁棒性。流程如图3所示。相似性编码器将渐进式迭代生成的输出图像I_out作为输入，辅助编码器将输入特征I_in作为输入，在辅助编码器的特征通过解码器之前，通过相似性编码卷积计算生成图像块和非空洞区域图像块的相似度S，其中卷积核获取操作是通过提取自身的补丁来处理背景特征图，大小设置为步长为2的4×4卷积核，避免造成棋盘效应。然后将相似度S送入辅助解码器，通过辅助解码器反卷积重建得到辅助图像。在辅助图像的指引下，使得背景空洞能够生成具有语义一致性的填充块。A context feature propagation loss module is added to the deep stage of the network to improve the robustness of feature matching. The process is shown in Figure 3. The similarity encoder takes the output image I _out generated by the progressive iteration as input, and the auxiliary encoder takes the input feature I _in as input. Before the features of the auxiliary encoder pass through the decoder, image patches and are generated through similarity encoding convolution calculations. The similarity S of the image block in the non-hole area, where the convolution kernel acquisition operation is to process the background feature map by extracting its own patch, and the size is set to a 4×4 convolution kernel with a step size of 2 to avoid causing a checkerboard effect. Then the similarity S is sent to the auxiliary decoder, and the auxiliary image is obtained through deconvolution and reconstruction of the auxiliary decoder. Under the guidance of auxiliary images, background holes can be generated with semantically consistent filling blocks.

不同于本文使用的其它损失项，上下文特征传播损失定义为辅助图像的空洞填充损失，可以看作是辅助图像块局部损失的总和，即Different from other loss terms used in this article, the context feature propagation loss is defined as the hole filling loss of the auxiliary image, which can be regarded as the sum of the local losses of the auxiliary image block, that is

式中，i表示迭代输出的I_out中相似度S最高的图像块索引，u_i表示相似度S最高的辅助图像块，l_i(·)是第i个辅助图像块的局部空洞损失。In the formula, i represents the image block index with the highest similarity S in the iterative output I _out , u _i represents the auxiliary image block with the highest similarity S, and l _i (·) is the local hole loss of the i-th auxiliary image block.

步骤S3：将网络迭代生成的特征图组I_out输入加权合并模块进行融合；Step S3: Input the feature map group I _out iteratively generated by the network into the weighted merging module for fusion;

将网络迭代生成的特征图组I_out输入加权合并模块进行融合。当渐进式迭代次数达到设定的阈值时，空洞区域填充完成。但如果直接使用此时输出的特征图，则会存在梯度消失和中间生成特征丢失问题。如果采用平均合并和自适应合并，则早期输出的重建图像中的缺失区域会影响最终输出图像的质量。总体流程如图4所示。The feature map group I _out generated iteratively by the network is input into the weighted merging module for fusion. When the number of progressive iterations reaches the set threshold, the filling of the hole area is completed. However, if the feature map output at this time is used directly, there will be problems of gradient disappearance and intermediate generated feature loss. If average merging and adaptive merging are adopted, the missing areas in the reconstructed images output early will affect the quality of the final output image. The overall process is shown in Figure 4.

为了解决上述问题，本文提出一种加权合并的方法来融合每次渐进式迭代生成的特征图。自适应合并方法通过分析掩膜生成权重，而无需学习过程，如图5a所示，首先划分特征图：In order to solve the above problems, this paper proposes a weighted merging method to fuse the feature maps generated by each progressive iteration. The adaptive merging method generates weights by analyzing masks without a learning process, as shown in Figure 5a, first dividing the feature map:

式中，t是迭代次数，和/>分别是第t次和第t-1次掩膜(mask)迭代更新后的特征图。本文为每次渐进式迭代生成的输出图像/>构造一个权重映射/>其公式如下：In the formula, t is the number of iterations, and/> They are the feature maps after the t-th and t-1th mask iterations, respectively. This article is the output image generated for each progressive iteration/> Construct a weight map/> The formula is as follows:

式中，j是本次待修复的图像块数目，N₁是待修复的图像块总数目，表示第t次迭代输出图像/>对应的第j个修复区域的权重映射，n为本次修复过程中生成的图像块总数量。In the formula, j is the number of image blocks to be repaired this time, N ₁ is the total number of image blocks to be repaired, Represents the t-th iteration output image/> The corresponding weight mapping of the jth repair area, n is the total number of image blocks generated during this repair process.

然后，使用softmax函数归一化得到权重映射值w_j。I_o由第t次迭代输出特征图在位置(x,y)的加权和组成：Then, use the softmax function to normalize to obtain the weight mapping value w _j . I _o outputs the feature map from the t-th iteration The weighted sum at position (x, y) consists of:

式中，H和W分别表示第t次迭代输出图像/>的高度和宽度。δ表示Leaky-ReLU激活函数，/>表示第t次迭代输出图像/>的特征值。In the formula, H and W respectively represent the t-th iteration output image/> height and width. δ represents the Leaky-ReLU activation function,/> Represents the t-th iteration output image/> eigenvalues.

加权合并是由软权重映射和自适应合并的输出特征图拼接(concat)后经过学习过程得到的自适应图，拼接操作可以在不破坏原始特征图的情况下整体地保留特征的信息。软权重方法的细节图如图5b所示，本文将I_out和输入特征映射I_in连接起来，以获得一个软权重映射：Weighted merging is an adaptive map obtained through a learning process after concatenating (concat) the output feature maps of soft weight mapping and adaptive merging. The concatenation operation can retain the feature information as a whole without destroying the original feature map. The details of the soft weight method are shown in Figure 5b. This paper connects I _out and the input feature map I _in to obtain a soft weight map:

W_s＝σ(Conv([I_in,I_out]))·(1-M)+M (12)W _s =σ(Conv([I _in ,I _out ]))·(1-M)+M (12)

式中σ是sigmoid激活函数，M是二值化图，0表示空洞区域，1表示有效区域。软权重得到的特征图的值I_s表示如下：In the formula, σ is the sigmoid activation function, M is the binary image, 0 represents the hole area, and 1 represents the effective area. The value I _s of the feature map obtained by soft weighting is expressed as follows:

与前述一种基于深度学习的虚拟视点绘制空洞填充方法的实施例相对应，本发明还提供了一种基于深度学习的虚拟视点绘制空洞填充装置的实施例。Corresponding to the foregoing embodiment of a deep learning-based virtual viewpoint rendering hole filling method, the present invention also provides an embodiment of a deep learning-based virtual viewpoint rendering hole filling device.

参见图6，本发明实施例提供的一种基于深度学习的虚拟视点绘制空洞填充装置，包括存储器和一个或多个处理器，存储器中存储有可执行代码，所述一个或多个处理器执行所述可执行代码时，用于实现上述实施例中的一种基于深度学习的虚拟视点绘制空洞填充方法。Referring to Figure 6, an embodiment of the present invention provides a virtual viewpoint rendering hole filling device based on deep learning, including a memory and one or more processors. The memory stores executable code, and the one or more processors execute The executable code is used to implement a deep learning-based virtual viewpoint rendering hole filling method in the above embodiment.

本发明一种基于深度学习的虚拟视点绘制空洞填充装置的实施例可以应用在任意具备数据处理能力的设备上，该任意具备数据处理能力的设备可以为诸如计算机等设备或装置。装置实施例可以通过软件实现，也可以通过硬件或者软硬件结合的方式实现。以软件实现为例，作为一个逻辑意义上的装置，是通过其所在任意具备数据处理能力的设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言，如图6所示，为本发明一种基于深度学习的虚拟视点绘制空洞填充装置所在任意具备数据处理能力的设备的一种硬件结构图，除了图6所示的处理器、内存、网络接口、以及非易失性存储器之外，实施例中装置所在的任意具备数据处理能力的设备通常根据该任意具备数据处理能力的设备的实际功能，还可以包括其他硬件，对此不再赘述。The embodiment of the present invention's virtual viewpoint rendering hole filling device based on deep learning can be applied to any device with data processing capabilities, and any device with data processing capabilities can be a device or device such as a computer. The device embodiments may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory and running them through the processor of any device with data processing capabilities. From the hardware level, as shown in Figure 6, it is a hardware structure diagram of any device with data processing capabilities where the virtual viewpoint drawing hole filling device based on deep learning of the present invention is located. In addition to the processor shown in Figure 6 , memory, network interface, and non-volatile memory, any device with data processing capabilities where the device in the embodiment is located may also include other hardware based on the actual functions of any device with data processing capabilities. In this regard No longer.

上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程，在此不再赘述。For details on the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method, and will not be described again here.

对于装置实施例而言，由于其基本对应于方法实施例，所以相关之处参见方法实施例的部分说明即可。以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本发明方案的目的。本领域普通技术人员在不付出创造性劳动的情况下，即可以理解并实施。As for the device embodiment, since it basically corresponds to the method embodiment, please refer to the partial description of the method embodiment for relevant details. The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the present invention. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

本发明实施例还提供一种计算机可读存储介质，其上存储有程序，该程序被处理器执行时，实现上述实施例中的一种基于深度学习的虚拟视点绘制空洞填充方法。Embodiments of the present invention also provide a computer-readable storage medium on which a program is stored. When the program is executed by a processor, the virtual viewpoint rendering hole filling method based on deep learning in the above embodiments is implemented.

所述计算机可读存储介质可以是前述任一实施例所述的任意具备数据处理能力的设备的内部存储单元，例如硬盘或内存。所述计算机可读存储介质也可以是任意具备数据处理能力的设备的外部存储设备，例如所述设备上配备的插接式硬盘、智能存储卡(Smart Media Card，SMC)、SD卡、闪存卡(Flash Card)等。进一步的，所述计算机可读存储介质还可以既包括任意具备数据处理能力的设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质用于存储所述计算机程序以及所述任意具备数据处理能力的设备所需的其他程序和数据，还可以用于暂时地存储已经输出或者将要输出的数据。The computer-readable storage medium may be an internal storage unit of any device with data processing capabilities as described in any of the foregoing embodiments, such as a hard disk or a memory. The computer-readable storage medium can also be an external storage device of any device with data processing capabilities, such as a plug-in hard disk, a smart memory card (SMC), an SD card, or a flash memory card equipped on the device. (Flash Card) etc. Furthermore, the computer-readable storage medium may also include both an internal storage unit and an external storage device of any device with data processing capabilities. The computer-readable storage medium is used to store the computer program and other programs and data required by any device with data processing capabilities, and can also be used to temporarily store data that has been output or is to be output.

以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的范围。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions described in the foregoing embodiments. Modify the technical solution, or make equivalent substitutions for some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solution to depart from the scope of the technical solution of the embodiments of the present invention.

Claims

1. A virtual viewpoint drawing cavity filling method based on deep learning is characterized by comprising the following steps:

step S1: based on the original image and a mask image thereof, obtaining a to-be-repaired image, inputting the to-be-repaired image and the mask image into a progressive iterative network, carrying out local hole identification on the to-be-repaired image by the progressive iterative network through partial convolution, carrying out hole filling based on a knowledge consistency attention mechanism, and constructing a loss function to promote semantic consistency of a background hole area and a known area;

step S2: constructing a context characteristic propagation loss, merging the loss function constructed in the step S1, outputting a progressive iterative network, performing similarity coding to obtain the similarity of the image blocks and the image blocks in the non-cavity area, and generating a filling block with semantic consistency by the background cavity based on the similarity;

step S3: and carrying out weighted combination on the output of the progressive iterative network to obtain a final repair graph.

2. The virtual viewpoint drawing cavity filling method based on deep learning according to claim 1, wherein: in step S1, the partial convolution uses only the effective pixels in the hole area to perform the operation, and the updated mask is preserved in the whole iteration process until the next iteration is reduced and updated.

3. The virtual viewpoint drawing cavity filling method based on deep learning according to claim 1, wherein: the knowledge consistent attention mechanism in step S1 is that first, the attention score of each patch is obtained by normalizing the inner product between the known feature patch vectors and the generated feature patch vectors; then, weighting the final attention score according to whether the mask pixel value generated in the last iteration is valid or not; finally, the attention score is used for updating the missing patch characteristics, and the structural consistency of the input characteristics and the reconstruction characteristics is improved through a convolution layer.

4. The virtual viewpoint drawing cavity filling method based on deep learning according to claim 1, wherein: in the step S1, the loss function merges L ₁ Loss, perceptual loss, style loss, and smoothing loss;

the L is ₁ Loss includes loss L of the cavity region _valid And loss L of cavity area _hole ：

L ₁ ＝λ ₁ L _valid +λ ₂ L _hole

Wherein lambda is ₁ And lambda (lambda) ₂ Respectively represent L _valid And L _hole M is a binarization map divided into a hole area and an effective area, and "" indicates dot product, I _gt The original image is represented by the original image,the C, H, W is the number of channels, the height and the width of the characteristic patch respectively;

the perceived loss is defined as:

wherein Io is the output result graph, composed ofHole-generating pixel and I in (1) _gt Non-hole pixels in (a); psi phi type _m (I _gt ) Sum phi _m (I _o ) Respectively represent I _gt And I _o Extracting the features output after the m pooling layer of the network through the pre-training features; h _m 、W _m And C _m Representing the height, width and channel number of the extracted mth feature map; n (N) ₂ Is the number of pooling layers;

the style loss is defined as:

wherein,the similarity of each layer of characteristics in the network is obtained by calculating the Gram matrix Gram of the characteristic diagram, namely

The smoothing loss is defined as:

wherein P is _i,j Representation I _o One pixel point of P _i,j+1 And P _i+1,j Respectively represent P _i,j Adjacent pixels in the vertical and horizontal directions, R is an expansion region in the hollow region, N _c Is I _o Is not included in the image block total number.

5. The virtual viewpoint drawing cavity filling method based on deep learning according to claim 1, wherein: in the step S2, the similarity between the image block and the non-hole area image block is generated by similarity encoding convolution calculation, wherein the convolution kernel obtaining operation is to process the background feature map by extracting the self patch.

6. The virtual viewpoint drawing cavity filling method based on deep learning according to claim 1, wherein: in the step S2, the image to be repaired is subjected to auxiliary encoding, and the encoding result is combined with the similarity to perform auxiliary decoding, so that an auxiliary image is obtained, and a filling block with semantic consistency is generated in the background cavity based on the auxiliary image guidance.

7. The virtual viewpoint drawing hole filling method based on deep learning according to claim 6, wherein: in the step S2, the context feature propagation loss is defined as the hole filling loss of the auxiliary image, i.e

Wherein I represents I of iterative output _out Image block index with highest similarity S, u _i Auxiliary image block with highest similarity S, l _i (·) is the local hole loss of the i-th auxiliary image block.

8. The virtual viewpoint drawing cavity filling method based on deep learning according to claim 1, wherein: in the step S3, the adaptive combining generates weights by analyzing the mask, and the method includes the following steps:

firstly, dividing a feature map:

where t is the number of iterations,and->The feature images after the iteration update of the t-th mask and the t-1 th mask are respectively the output images generated for each progressive iteration>Constructing a weight map +.>The formula is as follows:

wherein j is the number of image blocks to be repaired at this time, N ₁ Is the total number of image blocks to be repaired,representing the t-th iteration output image +.>The weight mapping of the corresponding j-th repair area, wherein n is the total number of image blocks generated in the current repair process;

then, a weight mapping value w is obtained by using a normalization function _j Most, at bestFinal output figure I _o Outputting the characteristic diagram by the t-th iterationWeighted sum composition at position (x, y):

wherein,h and W respectively represent the t-th iteration output image +.>Delta represents the activation function,/-the height and width of->Representing the t-th iteration output image +.>Is a characteristic value of (a).

9. The virtual viewpoint drawing cavity filling method based on deep learning according to claim 1, wherein: in the step S3, the weighted combination is an adaptive graph obtained through a learning process after the soft weight mapping and the adaptive combination of the output feature graphs are spliced, and the soft weight mapping is to iterate the output I _out And input feature map I _in And (3) connecting to obtain:

W _s ＝σ(Conv([I _in ,I _out ]))·(1-M)+M

wherein sigma is an activation function, M is a binarization map, and the map is divided into a cavity area, an effective area and a value I of a feature map obtained by soft weight _s The expression is as follows:

10. a deep learning-based virtual viewpoint drawing hole filling device, comprising a memory and one or more processors, wherein executable code is stored in the memory, and the one or more processors are used for realizing the deep learning-based virtual viewpoint drawing hole filling method according to any one of claims 1-9 when executing the executable code.