CN103942095B

CN103942095B - Two-dimensional phase position unwrapping method based on heterogeneous accelerating platform

Info

Publication number: CN103942095B
Application number: CN201410101628.0A
Authority: CN
Inventors: 吴振华; 马文静; 龙国平; 李玉成
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2014-03-18
Filing date: 2014-03-18
Publication date: 2017-02-15
Anticipated expiration: 2034-03-18
Also published as: CN103942095A

Abstract

The invention discloses a two-dimensional phase position unwrapping method based on a heterogeneous accelerating platform. Local matching is added in the Branch cut step, and the parallel realization bottleneck is overcome; the Block dynamic organizing method is adopted in the FloodFill step of the algorithm, and the data dependence problem is solved; the executing speed is increased, and the hardware resource is utilized to the maximum extent through the optimizing methods such as merging, compression storing, and data fake boundary creating. The real-time performance requirement of the two-dimensional phase unwrapping algorithm is met, and certain guiding significance is provided for parallel programming and data dependence overcoming.

Description

A 2D Phase Unwrapping Method Based on Heterogeneous Acceleration Platform

技术领域technical field

本发明涉及计算机视觉和图像处理领域，具体来说，本发明涉及一种基于异构加速平台的加速二维解缠绕方法。The invention relates to the fields of computer vision and image processing, in particular, the invention relates to an accelerated two-dimensional unwrapping method based on a heterogeneous acceleration platform.

背景技术Background technique

许多应用需要用到相位数据，例如显微干涉、合成孔径雷达（synthetic apertureradar，SAR）、磁共振成像（magnetic resonance imaging，MRI）和自适应光学(adaptiveoptics)等。计算相位一般使用反正切函数，其值域是(-π,π)，因而计算出来的相位被限定在(-π,π)之内，直接计算得到的相位被“截断”了(也称为被“包裹”或“缠绕”，wrapped)。实际过程中必须将被截断(包裹)的相位连接起来，即解除相位“截断”，把被包裹的相位恢复到实际的相位值，这个过程称为相位解缠绕(unwrap)、相位解包裹或者相位展开等。Many applications require phase data, such as microscopic interferometry, synthetic aperture radar (SAR), magnetic resonance imaging (MRI), and adaptive optics. The arctangent function is generally used to calculate the phase, and its value range is (-π, π), so the calculated phase is limited to (-π, π), and the directly calculated phase is "truncated" (also known as To be "wrapped" or "wrapped", wrapped). In the actual process, the truncated (wrapped) phases must be connected, that is, the phase "truncation" is released, and the wrapped phases are restored to the actual phase value. This process is called phase unwrap (unwrap), phase unwrapping or phase Expand etc.

二维相位解缠绕(Phase unwrapping，PU)是数字图像处理上一个常用的方法，是一个计算量很大，很耗时的过程。有几十种算法来解决二维解缠绕问题，主要分为两类：Path following方法和Minimum-Norm方法。Path following方法主要包括Goldstein’s算法、Quality-guided算法、Flynns最小不一致算法、PCG算法和Multigrid算法。而Minimum-Norm方法则有Minimum L_p-norm算法、Unweighted least-squares算法和Weighted least-squares算法等。在这些算法中，Goldstein算法运行速度快、输入简单、易于实现，通常作为二维解缠绕问题的第一选择。Two-dimensional phase unwrapping (Phase unwrapping, PU) is a commonly used method in digital image processing, which is a computationally intensive and time-consuming process. There are dozens of algorithms to solve the two-dimensional unwrapping problem, mainly divided into two categories: Path following methods and Minimum-Norm methods. Path following methods mainly include Goldstein's algorithm, Quality-guided algorithm, Flynns minimum inconsistency algorithm, PCG algorithm and Multigrid algorithm. The Minimum-Norm method includes the Minimum L _p -norm algorithm, the Unweighted least-squares algorithm, and the Weighted least-squares algorithm. Among these algorithms, the Goldstein algorithm is fast in operation, simple in input, and easy to implement, and is usually used as the first choice for two-dimensional unwrapping problems.

Goldstein算法分为三个过程。首先定义出相位数据中(通常是矩阵)的不一致点，表示可能的不连续位置的起始和结束，用Residue值来标记，用+/-1的Residue值来表示这些不一致点和其属性。然后用一个Branch cut过程连接相反极性的点，构成可能的不一致区域，该操作在图像中构成了一道道屏障，阻止在解缠绕操作时传播路径的通过，防止局部不一致的传播。最后，用一个FloodFill过程来实现解缠绕，从而得到连续的相位数据。The Goldstein algorithm is divided into three processes. First define the inconsistency points in the phase data (usually a matrix), indicating the start and end of possible discontinuous positions, marked with Residue values, and using +/-1 Residue values to represent these inconsistencies and their attributes. Then use a branch cut process to connect the points of opposite polarity to form possible inconsistencies. This operation constitutes a barrier in the image to prevent the passage of the propagation path during the unwrapping operation and prevent the propagation of local inconsistencies. Finally, a FloodFill process is used to achieve unwrapping to obtain continuous phase data.

(1)定义不一致点：图2和公式1～公式4描述了定义不一致点的过程。图2中，a、b、c、d是图像数据中的邻接四个相位点，Δi表示相位梯度，它是相邻之间的相位差“解缠”到(-π，π)的值。首先求出四个相位差(公式1，δ表示相位差，P表示相位)，然后解缠到(-π，+π)的范围内(公式2，公式3，W为解缠绕操作)，称为相位梯度：Δ，最后累加四个相位梯度(公式4)。相位梯度和可以为{-2π,0,2π}，即公式4中的n可能为{-1,0,1}，若n=1或n=-1表示该点存在不连续现象,为Residue值非0的点。(1) Definition of inconsistencies: Figure 2 and Equations 1 to 4 describe the process of defining inconsistencies. In Figure 2, a, b, c, and d are four adjacent phase points in the image data, and Δi represents the phase gradient, which is the value of "unwrapping" the phase difference between adjacent phases to (-π, π). First find out the four phase differences (Formula 1, δ represents the phase difference, P represents the phase), and then unwrap to the range of (-π, +π) (Formula 2, Formula 3, W is the unwrapping operation), called is the phase gradient: Δ, and finally accumulate four phase gradients (Equation 4). The phase gradient sum can be {-2π,0,2π}, that is, n in formula 4 may be {-1,0,1}, if n=1 or n=-1 means that there is a discontinuity at this point, which is Residue Points with a value other than 0.

δ_i＝Ρ_i-Ρ_i-1 公式1δ _i =Ρ _i -Ρ _i-1 Formula 1

W(Ρ_i)＝Ρ_i+2k_iπk_i∈Z 公式2W(Ρ _i )＝Ρ _i +2k _i πk _i ∈ Z Formula 2

Δ_i＝W(δ_i) 公式3Δ _i =W(δ _i ) Formula 3

公式4 Formula 4

(2)Branch Cut：图3描述了Branch cut过程中形成的“积分栅栏”，它能阻止积分路径的通过，其中黑色的圈代表Residue值为+1的点，灰色的圈代表Residue值为-1的点，箭头表示可能的正确积分方向。在第一步中的每一个Residue值为+1的点，在其周围找到最近的-1的点或者边界，然后这两个点及其连线上的所有点都被Branch cut掉，形成一条“栅栏”。接下来则由Residue值为-1的点向周围寻找+1的点(或边界)。(2) Branch Cut: Figure 3 describes the "integral fence" formed during the branch cut process, which can prevent the passage of the integral path, where the black circle represents the point with a Residue value of +1, and the gray circle represents a Residue value of - 1, the arrow indicates the possible correct integration direction. For each point with a Residue value of +1 in the first step, find the nearest -1 point or boundary around it, and then these two points and all points on their connection line are cut by Branch to form a line "fence". Next, look for +1 points (or borders) around the point with a Residue value of -1.

而在Branch cut之前可以加一步预处理过程：Dipole cut，提高处理速度。相邻并且有着相反极性的Residue值的点称为偶极子(Dipole)。在求出Residue值后随即寻找偶极子，将这两个点标记为BranchCut点。Before the Branch cut, a preprocessing step can be added: Dipole cut to improve the processing speed. Adjacent points with Residue values of opposite polarity are called dipoles. After obtaining the Residue value, search for dipoles immediately, and mark these two points as BranchCut points.

(3)FloodFill过程是一个比较成熟的算法，被包含在很多函数库中。图4描述了FloodFill的过程。首先选一个种子点，这个点必须为一个“好”点(不能被branch cut掉)，一般选相位矩阵的中心。随即这个种子点周围的四个点(上下左右)执行解缠绕工作，然后这四个点加入种子点的列表中去。迭代执行，直到所有的点要么已经被解缠绕要么是BranchCut点。图4中，黑色的点表示BranchCut的点，灰色的点表示种子点或已经被解缠的点。(3) The FloodFill process is a relatively mature algorithm, which is included in many function libraries. Figure 4 describes the process of FloodFill. First select a seed point, this point must be a "good" point (cannot be cut by branch), generally choose the center of the phase matrix. Immediately, the four points (up, down, left, and right) around the seed point perform unwrapping work, and then these four points are added to the list of seed points. This is performed iteratively until all points have either been unwrapped or are BranchCut points. In Figure 4, the black points represent BranchCut points, and the gray points represent seed points or points that have been unwrapped.

近年来，以GPGPU为代表的异构加速平台成为日益重要的计算趋势，其平台涵盖了高性能计算、桌面计算和移动计算等宽阔的领域。相应日益丰富的应用领域和异构加速平台融合能大幅提高系统的整体能效，特别的，若待开发的程序符合大规模并行结构，利用GPU-CUDA技术能大幅提升运行效率。Goldstein算法虽然是解决二维解缠绕最快的算法，但是仍然有着很大的计算量，需要很长的运行时间，在很多情况下成为了系统性能的瓶颈。利用GPGPU技术可以提高Goldstein算法的性能。但是其中Branch cut步骤和FloodFill步骤有着严重的数据依赖，制约着算法的并行化和并行效率。因此，本发明正是针对Branch cut步骤和FloodFill步骤的创新。In recent years, the heterogeneous acceleration platform represented by GPGPU has become an increasingly important computing trend, and its platform covers a wide range of high-performance computing, desktop computing and mobile computing. The integration of increasingly rich application fields and heterogeneous acceleration platforms can greatly improve the overall energy efficiency of the system. In particular, if the program to be developed conforms to a large-scale parallel structure, using GPU-CUDA technology can greatly improve operating efficiency. Although the Goldstein algorithm is the fastest algorithm for solving two-dimensional unwrapping, it still has a large amount of calculation and requires a long running time, which has become the bottleneck of system performance in many cases. Using GPGPU technology can improve the performance of Goldstein algorithm. However, the Branch cut step and the FloodFill step have serious data dependence, which restricts the parallelization and parallel efficiency of the algorithm. Therefore, the present invention is just an innovation aimed at the Branch cut step and the FloodFill step.

发明内容Contents of the invention

本发明的技术解决问题：克服现有技术的不足，提供一种基于异构加速平台的二维相位解缠绕方法，利用异构平台并行加速处理技术在不降低精度的情况下提高算法的实时性，并具有良好的可扩展性。The technology of the present invention solves the problem: overcomes the shortcomings of the existing technology, provides a two-dimensional phase unwrapping method based on a heterogeneous acceleration platform, and uses the heterogeneous platform parallel acceleration processing technology to improve the real-time performance of the algorithm without reducing the accuracy , and has good scalability.

本发明的技术方案：一种基于异构加速平台的二维相位解缠绕方法，其各个步骤都在异构平台上实现，其步骤如下：The technical solution of the present invention: a two-dimensional phase unwrapping method based on a heterogeneous acceleration platform, each step of which is implemented on the heterogeneous platform, and the steps are as follows:

（1）异构加速平台上定义相位数据中不一致的点，即Residue值为+/-1的点。(1) Define inconsistent points in the phase data on the heterogeneous acceleration platform, that is, points with a Residue value of +/-1.

（11）把相位数据分块，读入共享内存，共享内存中线程组(Block)中的每个线程计算一个点的Residue值，并保存；为了每个Block的最后一行和最后一列能按照一样的规则计算，每个Block需要多读入一行和一列数据；(11) Divide the phase data into blocks and read them into the shared memory. Each thread in the thread group (Block) in the shared memory calculates the Residue value of a point and saves it; for the last row and the last column of each Block can follow the same According to the rule calculation, each block needs to read one more row and one column of data;

（12）在求出Residue值之后，进行寻找并匹配偶极子操作(Dipole cut)，所述偶极子是指相邻并且有着相反极性的Residue值的点，如果该点的Residue值为+1，在该点周围寻找极性相反的点，如果有极性相反的点，则这两个点标记为BranchCut点，并分别把这两个对应的Residue值标记为+2或者-2，得到Residue数据；(12) After calculating the Residue value, search and match the dipole operation (Dipole cut), the dipole refers to the point adjacent to the Residue value with the opposite polarity, if the Residue value of the point is +1, look for points with opposite polarities around this point, if there are points with opposite polarities, mark these two points as BranchCut points, and mark the two corresponding Residue values as +2 or -2 respectively, Get Residue data;

（2）异构加速平台上匹配不一致点的过程，即Branch cut步骤，该步骤分为两个过程：局部匹配和全局匹配。(2) The process of matching inconsistent points on the heterogeneous acceleration platform, that is, the branch cut step, which is divided into two processes: local matching and global matching.

（21）首先进行局部匹配，把Residue数据分块后读入共享内存，在每一个Block内进行Branch cut操作，先由Residue值为+1的去寻找最近Residue值极性为负(-1、-2)的点或者数据边界，直到超过搜索范围；同步操作之后再由-1的点去需找极性为正的点，每一个Block多读N圈的Residue输入数据，即上下左右各多读N行，N为一个正整数参数，提高精度；如果该Block有Residue值非0的点未匹配，则把对各Block的标识(BlockFlag)写为+1或者-1，其极性跟Residue值极性相同；(21) First perform partial matching, divide the Residue data into blocks and read it into the shared memory, perform a Branch cut operation in each Block, and first find the nearest Residue value whose polarity is negative (-1, -2) point or data boundary, until it exceeds the search range; after the synchronization operation, go to the point with positive polarity from the point of -1, and each block reads more than N circles of Residue input data, that is, up, down, left, and right respectively. Read N lines, where N is a positive integer parameter to improve precision; if the Block has a non-zero Residue point that does not match, write the block flag (BlockFlag) as +1 or -1, and its polarity is the same as Residue The values have the same polarity;

（22）然后进行全局匹配，全局匹配与局部匹配一样的线程配置，采用一个线程读取BlockFlag标志，如果该BlockFlag标志为0，则结束，如果该BlockFlag标志不为0就说明该Block需要进行全局匹配；全局匹配过程时，若BlockFlag值为1，则找到最近的BlockFlag为-1或者边界上的Block，计算它们之间的最远距离和最近距离作为搜索范围，在该范围中肯定能找到匹配的点，得到匹配后的BranchCut数据，BranchCut即各个点经过Branch cut操作后的值；(22) Then perform global matching. Global matching and local matching have the same thread configuration. Use one thread to read the BlockFlag flag. If the BlockFlag flag is 0, it will end. If the BlockFlag flag is not 0, it means that the Block needs to be globally Matching; during the global matching process, if the BlockFlag value is 1, find the nearest BlockFlag of -1 or the Block on the boundary, calculate the farthest distance and the shortest distance between them as the search range, and definitely find a match in this range points, get the matched BranchCut data, and BranchCut is the value of each point after the Branch cut operation;

（3）在异构加速平台上实现FloodFill步骤，输出解缠绕后的相位数据，具体实现如下：(3) Implement the FloodFill step on the heterogeneous acceleration platform, and output the unwrapped phase data. The specific implementation is as follows:

输入匹配后的BranchCut数据，BranchCut数据中选取种子点（什么是种子点）作为起始点，标记为已经被解缠绕，种子点不能是被Branch cut的点，通过有限次重启设备函数Kernel，以一种动态的Block的组织方式把种子状态从中心往四周扩散；其中在Block内部，每个线程一次处理一行或者一列数据，然后四个方向，即从左到右、从上到下、从右到左、从下到上依次进行解缠绕处理，一个线程处理一行或一列后进行解缠绕处理需要一次同步，直到整个Block中没有需要被解缠绕的点时停止迭代，输出解缠绕后的相位数据。Input the matched BranchCut data, select the seed point (what is the seed point) as the starting point in the BranchCut data, mark it as unwound, the seed point cannot be the point cut by Branch, restart the device function Kernel for a limited number of times, and use a A dynamic block organization spreads the seed state from the center to the surroundings; inside the block, each thread processes one row or one column of data at a time, and then four directions, namely, from left to right, from top to bottom, from right to Left, and from bottom to top, the unwrapping process is performed sequentially. One thread needs to be synchronized once after processing a row or a column, and the iteration stops when there are no points that need to be unwrapped in the entire block, and the phase data after unwrapping is output.

在各个步骤中，对计算中使用和得到的各种数据均采用压缩存储，以提高运算速度。In each step, all kinds of data used and obtained in the calculation are compressed and stored to improve the operation speed.

（11）和（12）两个步骤合并，即在同一个设备函数Kenel内执行完成。The two steps (11) and (12) are combined, that is, they are executed in the same device function Kenel.

在步骤（21）的局部匹配中，把Residue数据分块后读入共享内存之前，先给Residue数据伪造一个边界，称为伪边界，即对行或者列的数目不是16的倍数时，扩充为16的倍数，用一个Residue值域外的值填充补全的部分；每个Block中，用同样的方法给Residue数据补N圈。In the partial matching of step (21), before the Residue data is divided into blocks and read into the shared memory, a boundary is forged for the Residue data, which is called a pseudo boundary, that is, when the number of rows or columns is not a multiple of 16, it is expanded to A multiple of 16, fill the completed part with a value outside the Residue value range; in each Block, use the same method to fill N circles for the Residue data.

在步骤（3）中，以一种动态的Block的组织方式把种子状态从中心往四周扩散式的实现过程为：将输入匹配后的BranchCut数据分为固定大小的块，称为small_block，一个Block处理4个邻接的small_Block组成的数据，称为一个big_Block；small_Block在big_Block中的分布呈“田”字型；big_Block有两种不同的组织方式，连续两次重启Kernel，对应位置的Block处理的small_Blcok不完全相同，即big_Block的组织方式会发生改变，这样small_Block会在不同的big_Block之间轮换，有些small_Block刚被缠绕后随即作为种子Block传播进另一个big_Block，这样就能使种子状态从中心往四周扩散；在上述操作中需要两个标志：Block_flag和Pixel_flag分别记录Block和每一个点的状态，Block_flag标识该small_Block是否已经被解缠，而Pixel_flag标识该点是否被解缠。In step (3), the implementation process of spreading the seed state from the center to the surroundings in a dynamic block organization is as follows: divide the input matched BranchCut data into blocks of fixed size, called small_block, a Block Processing data composed of 4 adjacent small_Blocks is called a big_Block; the distribution of small_Block in big_Block is in the shape of "field"; big_Block has two different organization methods, restarting the Kernel twice in a row, and small_Block processed by the Block at the corresponding position Not exactly the same, that is, the organization of big_Block will change, so that small_Block will rotate between different big_Blocks, and some small_Blocks will be spread into another big_Block as a seed block after being wound, so that the seed state can be made from the center to the surrounding Diffusion; two flags are required in the above operation: Block_flag and Pixel_flag respectively record the state of the Block and each point, Block_flag indicates whether the small_Block has been unwrapped, and Pixel_flag indicates whether the point is unwrapped.

在步骤（3）中的Block内部，解缠绕操作是按照从右到左、从上到下、从左到右和从下到上四个方向按行执行的，读取四个small_Block的Block_flag，不仅可以确定该Block是否要执行解缠绕操作，还可以确定种子点是从big_Block的哪一个角传播进来的，然后确定四个方向的执行顺序。Inside the Block in step (3), the unwinding operation is performed in four directions from right to left, from top to bottom, from left to right and from bottom to top, and the Block_flags of the four small_Blocks are read. Not only can it be determined whether the Block is to be unwound, but also it can be determined from which corner of the big_Block the seed point is propagated, and then the order of execution in the four directions can be determined.

本发明与现有技术相比的有益效果在于：The beneficial effect of the present invention compared with prior art is:

（1）本发明是在异构平台上完整实现用Goldstein算法进行二维解缠绕的方法，显著提高了二维解缠绕算法的执行速度，在输入数据大小为8192*8192的相位数据时，达到了超过500倍的加速比。(1) The present invention completely realizes the two-dimensional unwrapping method using the Goldstein algorithm on a heterogeneous platform, which significantly improves the execution speed of the two-dimensional unwrapping algorithm. When the input data size is 8192*8192 phase data, it reaches A speedup of more than 500 times was achieved.

（2）本发明在Branch cut步骤中局部匹配，把匹配过程局限在一个范围内，使整个二维解缠绕过程能在异构平台上完成并行计算，其思想对并行程序解决数据依赖有一定的参考意义。(2) The present invention performs partial matching in the Branch cut step, confining the matching process within a certain range, so that the entire two-dimensional unwrapping process can complete parallel computing on a heterogeneous platform. D.

（3）本发明中的FloodFill步骤达到了近800倍的加速比，其本身也是一个常用的算法，包含在很多函数库中；在其的实现过程中的动态组织方法也对并行编程和克服数据依赖有一定的指导价值。(3) The FloodFill step in the present invention has reached nearly 800 times of speedup, and it is also a commonly used algorithm in itself, which is included in many function libraries; the dynamic organization method in its implementation process is also very important for parallel programming and overcoming data Dependence has certain guiding value.

附图说明Description of drawings

图1为本发明实现流程图；Fig. 1 is the realization flowchart of the present invention;

图2为本发明中定义Residue示意图；Fig. 2 is the schematic diagram of defining Residue in the present invention;

图3为本发明中Branch Cut过程示意图；Fig. 3 is the Branch Cut process schematic diagram among the present invention;

图4为本发明中FloodFill过程示意图，其中黑色的点代表位于BranchCut的点，白色的点代表需要被解缠绕的点，而灰色的点代表已经解缠绕的点；从左往右数，第一幅图是起始状态，第二幅图是选取种子点后的状态，第三幅图是解缠绕中的状态，第四幅图是全部解缠绕后的状态；Fig. 4 is a schematic diagram of the FloodFill process in the present invention, wherein the black points represent the points located in BranchCut, the white points represent the points that need to be unwound, and the gray points represent the points that have been unwound; counting from left to right, the first The first picture is the initial state, the second picture is the state after selecting the seed point, the third picture is the state of unwrapping, and the fourth picture is the state after all unwrapping;

图5为本发明中FloodFill过程Block的动态组织结构示意图，其中左图是选取种子点后所在Block的状态，有图是第一次完成Kernel后的状态；Fig. 5 is a schematic diagram of the dynamic organizational structure of the FloodFill process Block in the present invention, wherein the left figure is the state of the Block after selecting the seed point, and the figure is the state after completing the Kernel for the first time;

图6为本发明中FloodFill过程中某个Block的状态示意图；Fig. 6 is a schematic diagram of the state of a certain Block in the FloodFill process in the present invention;

图7为使用不同压缩存储方法后共享内存的使用量，其中“无”表示不压缩存储，“A”表示合并输入输出数据，“B”表示合并BranchCut和PixelFlag，“C”表示1个字节存储中间变量；Figure 7 shows the usage of shared memory after using different compression storage methods, where "None" means no compressed storage, "A" means merging input and output data, "B" means merging BranchCut and PixelFlag, and "C" means 1 byte Store intermediate variables;

图8为使用不同压缩存储方法后同时激活的Block数量。Figure 8 shows the number of simultaneously activated blocks after using different compression storage methods.

具体实施方式detailed description

在介绍本发明之前，先对一些基本概念和术语进行一下说明。Before introducing the present invention, some basic concepts and terms are explained.

技术术语和专有名词说明：Description of technical terms and proper nouns:

GPGPU:图形处理器通用计算技术GPGPU: Graphics Processing Unit General Computing Technology

Block:CUDA专用术语，表示一个线程组Block: CUDA-specific term, representing a thread group

Kernel：CUDA专用术语，表示一个设备函数Kernel: CUDA-specific term, representing a device function

Phase unwrap:相位解缠绕Phase unwrap: Phase unwrapping

Residue：相位数据中的不一致点Residue: Inconsistent points in phase data

Dipole：偶极子，邻接且具有相反Residue极性的点对。Dipole: Dipole, a pair of points that are adjacent and have opposite Residue polarity.

Dipole cut:偶极子匹配的过程Dipole cut: the process of dipole matching

Branch cut:匹配不一致点的过程(两个单词)Branch cut: the process of matching inconsistent points (two words)

BranchCut：Branch cut步骤中得到的结果(一个单词)BranchCut: the result obtained in the Branch cut step (one word)

FloodFill:水漫填充，在相位解缠绕算法中用来解缠过程FloodFill: flood filling, used in the unwrapping process in the phase unwrapping algorithm

下面结合附图及实施例对本发明进一步详细说明。The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

如图1所示，本发明具体实现步骤如下：As shown in Figure 1, the specific implementation steps of the present invention are as follows:

本发明的分为三个部分，都在异构平台上并行实现：定义Residue与Dipole cut、Branch cut和FloodFill。The present invention is divided into three parts, all implemented in parallel on heterogeneous platforms: defining Residue and Dipole cut, Branch cut and FloodFill.

（1）定义Residue与Dipole cut。(1) Define Residue and Dipole cut.

这两个过程放在一个Kernel中，可以减少读全局存储器的次数，提高性能。Block配置为16*16个线程(可调整)。其具体步骤为：These two processes are placed in one Kernel, which can reduce the number of times to read the global memory and improve performance. Block configuration is 16*16 threads (adjustable). The specific steps are:

（11）读入相位数据到共享内存。由于每个点需要用到本身、下、右下、右边四个Phase数据，为了处理边界的方便，每个Block需要读入17*17的Phase数据，也就是多读一行和一列。相位数据只读不写，不会发生冲突。(11) Read the phase data into the shared memory. Since each point needs four Phase data of itself, bottom, bottom right, and right, for the convenience of processing the boundary, each Block needs to read 17*17 Phase data, that is, read one more row and one column. Phase data can only be read but not written, and no conflicts will occur.

（12）每个线程求一个点的Residue值，根据图2和公式1～4定义Residue。(12) Calculate the Residue value of a point for each thread, and define the Residue according to Figure 2 and formulas 1-4.

（13）需要一个同步操作，确保Block内所有点都求出了Residue的值。(13) A synchronous operation is required to ensure that all points in the Block have obtained the value of Residue.

（14）接着执行Dipole cut操作：值为+1的点到其周围的八个点寻找是否有-1的点，若有，则匹配，分别标志其Residue值为2或-2。在Dipole cut匹配的过程中，在Branchcut步骤中可能会再次用到，能够提高精度。每寻找一个方向则同步一次，可以避免Block内的冲突。相位数据和Residue数据都读/写入设备的共享内存中，大大提高了执行速度。(14) Then execute the Dipole cut operation: from the point with a value of +1 to the eight points around it, find whether there is a point of -1, if there is, match it, and mark its Residue value as 2 or -2 respectively. In the process of Dipole cut matching, it may be used again in the Branchcut step, which can improve the accuracy. Synchronize every time you look for a direction, which can avoid conflicts in the Block. Both phase data and residue data are read/written in the shared memory of the device, which greatly improves the execution speed.

（2）Branch cut步骤。(2) Branch cut step.

该过程输入Residue数据，并读入共享内存。为了解决数据依赖并获得很好的性能，本发明把这一步分为了两个过程：局部匹配和全局匹配。划分Block后，Block内线程的个数和数据的个数相等(例如都为16*16个)。The process enters Residue data and reads it into shared memory. In order to solve data dependence and obtain good performance, the present invention divides this step into two processes: local matching and global matching. After the block is divided, the number of threads in the block is equal to the number of data (for example, both are 16*16).

（21）局部匹配。把Branch cut操作限定在Block内，使得该步能并行执行。步骤为：(21) PARTIAL MATCHING. Limit the Branch cut operation to the Block so that this step can be executed in parallel. The steps are:

（211）制造“伪边界”。相位数据的大小可能并不是Block大小的整数倍，例如相位数据大小为135*135(行*列)，而Block为16*16。由于所有Block执行一样的操作，相位数据边界上的Block对数据边界的控制逻辑比较复杂，所有的Block都需要执行边界判断操作。制造伪边界后不需要边界判断逻辑：(211) Create a "pseudo boundary". The size of the phase data may not be an integer multiple of the block size, for example, the phase data size is 135*135 (row*column), while the block is 16*16. Since all blocks perform the same operation, the control logic of the block on the phase data boundary to the data boundary is relatively complicated, and all blocks need to perform boundary judgment operations. No boundary judgment logic is required after creating a pseudo-boundary:

假设Block的线程配置为16*16，相位数据的尺寸为Row*Column，用一个未用到的值(如9)把Residue数据补全为：(16*k_r+N)*(16*k_c+N)，即创造了一个伪边界，从而省略边界的控制逻辑，简化程序。满足：Row<16*k_r+0N<Row+16，Column<16*k_c+N<Column+16，其中k_r、k_c为正整数。这一步在共享内存时直接填充就可以做到，不增加使用设备存储器的大小，从而也不增加多余的操作。BranchCut按照同样的方法补全。Assuming that the thread configuration of Block is 16*16, and the size of phase data is Row*Column, use an unused value (such as 9) to complete the Residue data as: (16*k _r +N)*(16*k _c + N), which creates a pseudo boundary, thereby omitting the control logic of the boundary and simplifying the program. Satisfy: Row<16*k _r +0N<Row+16, Column<16*k _c +N<Column+16, where k _r and k _c are positive integers. This step can be done by directly filling in the shared memory, without increasing the size of the used device memory, and thus not increasing redundant operations. BranchCut completes in the same way.

（212）读Residue数据进共享内存。把匹配范围局限在Block内之后，可能产生错误的导向而去匹配“错误”的点(其最佳匹配可能在旁边的Block内)，所以跟上一步类似的思想，每个Block多读N圈数据，N为小于4的整数。实验中N为2就能达到很好的效果。处于数据边界Block，直接用上一步“伪边界的方法”在共享内存中补全。(212) Read Residue data into shared memory. After limiting the matching range to the Block, it may generate a wrong orientation to match the "wrong" point (the best match may be in the next Block), so it is similar to the previous step, read N circles more for each Block Data, N is an integer less than 4. In the experiment, when N is 2, good results can be achieved. If it is at the data boundary Block, directly use the "pseudo-boundary method" in the previous step to complete it in the shared memory.

（213）每一个线程读取指定位置的Residue值，若为+1，则到周围去寻找-1的点或者数据边界(不是Block边界，超过Block边界后则不处理)，若为0则结束。(213) Each thread reads the Residue value at the specified position. If it is +1, then go around to find a -1 point or data boundary (not the Block boundary, and it will not be processed after exceeding the Block boundary), and if it is 0, it will end .

（214）若找到，在存放BranchCut数据的矩阵中，把这两个点连线上的点对应的位置上写为1，表示这些点为Branch cut点。并分别把这两个点的Residue值设为+2和-2。(214) If found, in the matrix storing the BranchCut data, write 1 at the position corresponding to the point on the connecting line between these two points, indicating that these points are Branch cut points. And set the Residue values of these two points to +2 and -2 respectively.

（215）若未找到，则把该Block的BlockFlag置为+1，表示该Block未完全匹配。(215) If not found, set the BlockFlag of the Block to +1, indicating that the Block is not completely matched.

（216）同步后，对Residue值为-1的情况执行2.1.3-2.1.5类似的操作。(216) After synchronization, perform operations similar to 2.1.3-2.1.5 for the case where the Residue value is -1.

（217）写回数据。但可能的存在写冲突：在多读入的N圈数据的对应位置写Branchcut数据时，由于相邻的Block可能同时写这个值，会带来写冲突。但注意到Branch cut的值只为0和1让值为1时才写进Branch cut，可以解决避免冲突。另一方面由于Block之间的快慢关系，导致写Residue值时，在相应位置出现同样的冲突。同样，只对改变了的值进行写操作可以避免冲突。(217) Write back data. However, there may be write conflicts: when writing Branchcut data at the corresponding position of the multi-read N circle data, since adjacent blocks may write this value at the same time, it will cause write conflicts. But notice that the value of Branch cut is only written into Branch cut when the value is 0 and 1 so that the value is 1, which can solve the problem of avoiding conflicts. On the other hand, due to the speed relationship between blocks, the same conflict occurs at the corresponding position when writing the Residue value. Also, writing only to values that have changed avoids conflicts.

（22）全局匹配。如果Local matching中留下未匹配的点（例如读入该Block所以的Residue点都为+1，这些点都没有被匹配），则会让全局匹配变得很必要，即在整个相位数据中寻找能匹配的点。详细步骤为：(22) GLOBAL MATCHING. If there are unmatched points left in Local matching (for example, the Residue points read into the Block are all +1, and these points are not matched), it will make global matching necessary, that is, to find in the entire phase data points that can match. The detailed steps are:

（221）由一个线程读取BlockFlag标志。在局部匹配的2.1.5和2.1.6步骤时写入了BlockFlag的值，它能判断该Block是否需要进行全局匹配操作。若BlockFlag为0，则结束。(221) The BlockFlag flag is read by a thread. In steps 2.1.5 and 2.1.6 of local matching, the value of BlockFlag is written, which can determine whether the Block needs to perform global matching operations. If BlockFlag is 0, then end.

（222）若BlockFlag值为1，则找到最近的BlockFlag为-1的Block，或者边界上的Block，计算它们之间的最远距离和最近距离作为搜索范围，在该范围中肯定能找到匹配的点。(222) If the BlockFlag value is 1, then find the nearest Block with a BlockFlag of -1, or a Block on the boundary, and calculate the farthest and shortest distances between them as the search range, in which a match can definitely be found point.

（223）若BlockFlag值为-1，操作跟上一步类似。(223) If the BlockFlag value is -1, the operation is similar to the previous step.

（3）FloodFill步骤。(3) FloodFill step.

该过程就像一块石头投入湖中，然后波纹一圈一圈向四周传播，直到湖岸。整个湖面就像相位数据，石头就像种子点，波纹的传播方式就是FloodFill的作用了。图4简单了描述了该步骤。The process is like throwing a stone into a lake, and then the ripples spread round and round until the lake shore. The entire lake surface is like phase data, stones are like seed points, and the propagation of ripples is the function of FloodFill. Figure 4 briefly describes this step.

（31）动态的Block的组织方式：(31) Organization of dynamic blocks:

把整个相位数据划分为一个个块(例如16*16)，根据算法，需要选取一个种子点，本发明首先在最中间的Block中选择种子点。当某个Block中有已经有被解缠的点，这个Block可以被激活执行解缠操作。这样有一个重要问题是怎样在Block之间传播种子状态(已经被解缠的点)，激活其他的Block进行计算。The entire phase data is divided into blocks (for example, 16*16), according to the algorithm, a seed point needs to be selected, and the present invention first selects the seed point in the most middle Block. When there are already unwrapped points in a Block, this Block can be activated to perform the unwrapping operation. In this way, an important issue is how to propagate the seed state (points that have been disentangled) between Blocks and activate other Blocks for calculation.

为了解决种子状态的传播，本发明创造了一种Block的动态组织方式。图5中描述了这种传播方式。称上述提到的块为small_Block，一个Block处理4个邻接的small_Block组成的数据，不妨称为一个big_Block，如在图5中，细线划分成small_Block，粗线划分的为big_Block，small_Block在big_Block中的分布呈“田”字型。big_Block有两种不同的组织方式，连续两次重启Kernel，对应位置的Block处理的small_Blcok不完全相同，即big_Block的组织方式会发生改变。这样small_Block会在不同的big_Block之间轮换，有些small_Block刚被unwrap后随即作为种子Block传播进另一个big_Block，这样就能使种子状态从中心往四周扩散。In order to solve the propagation of the seed state, the present invention creates a dynamic organization method of Block. This mode of propagation is depicted in Figure 5. The above-mentioned block is called small_Block. A Block processes data composed of 4 adjacent small_Blocks. It may be called a big_Block. As shown in Figure 5, the thin line is divided into small_Block, and the thick line is divided into big_Block. Small_Block is in big_Block The distribution is in the shape of "Tian". big_Block has two different organization methods. If the Kernel is restarted twice in a row, the small_Block processed by the corresponding block is not exactly the same, that is, the organization method of big_Block will change. In this way, the small_Block will rotate between different big_Blocks. Some small_Blocks will be spread into another big_Block as a seed block after being unwrapped, so that the seed state can spread from the center to the surrounding.

需要两个标志：Block_flag和Pixel_flag分别记录Block和每一个点的状态，Block_flag标识该small_Block是否已经被解缠，而Pixel_flag标识该点是否被解缠。Two flags are required: Block_flag and Pixel_flag respectively record the status of Block and each point, Block_flag indicates whether the small_Block has been unwrapped, and Pixel_flag indicates whether the point is unwrapped.

（32）Block内部：(32) Inside the Block:

原算法中，为四邻接法，即某个点解缠之后，激活上下左右四个点，然后这四个点可以进行解缠绕过程，但这样不利于并行处理。为了能够顺利并行化并取得不错的效果，可以让一个线程一次处理一行(或者一列)数据，例如该行中前一个点已经被解缠，且待处理的点本身不是BranchCut点，则该点可以被解缠。一个线程从左到右处理一行后，再从上到下、从右到左、从下到上四个方向依次处理，直到所有的点都被解缠。每处理一行之后需要一次同步操作。In the original algorithm, it is the four-neighbor-joining method, that is, after a certain point is unwrapped, activate the upper, lower, left, and right four points, and then these four points can be unwound, but this is not conducive to parallel processing. In order to achieve smooth parallelization and achieve good results, one thread can process one row (or one column) of data at a time. For example, the previous point in the row has been unwrapped, and the point to be processed is not a BranchCut point, then the point can be Unwrapped. After a thread processes a line from left to right, it then processes it sequentially from top to bottom, right to left, and bottom to top until all points are unwrapped. A synchronization operation is required after each row is processed.

若在一个Block中，如果每种情况下都是按照从左到右，再从上到下，从右到左，最后从下到上这种方式，然后不断迭代，可能会导致产生“空”的迭代(即处理某行时，没有新增加被解缠的点)。如图6中所示的情况下，需要三次迭代才能使整个Block被解缠。但是如果按照先右到左，再从上到下，接着从左到右，最后从下到上，这样只需要两次迭代就能完成。从Block level可以看到，种子点只能从矩阵的四个角中传进来，所以用一个标志记录种子点传播进来的位置，根据标志选择解缠起始的方向，能大大减少迭代次数，提高性能。If in a Block, if each case follows the method from left to right, then from top to bottom, from right to left, and finally from bottom to top, and then iterates continuously, it may result in "empty" iterations (that is, when a row is processed, no new unwrapped points are added). In the case shown in Figure 6, three iterations are required to unwrap the entire block. But if you go from right to left, then from top to bottom, then from left to right, and finally from bottom to top, it only takes two iterations to complete. As can be seen from the Block level, the seed point can only be passed in from the four corners of the matrix, so use a flag to record the position where the seed point propagates, and select the starting direction of unwrapping according to the flag, which can greatly reduce the number of iterations and improve performance.

(4)压缩存储(4) Compressed storage

压缩存储能节约存储空间，特别是节约共享存储的空间从而增加同时激活的Block数量；而且压缩存储还能减少访存、计算的次数，大大提高程序的性能。Compressed storage can save storage space, especially the space of shared storage to increase the number of blocks activated at the same time; and compressed storage can also reduce the number of memory accesses and calculations, greatly improving the performance of the program.

（41）本发明中，输入数据(待解缠的相位数据)和输出数据(已解缠的相位数据)使用相同的一块存储，即直接在输入数据上进行解缠的操作。(41) In the present invention, the input data (phase data to be unwrapped) and the output data (phase data that has been unwrapped) use the same piece of storage, that is, the unwrapping operation is performed directly on the input data.

（42）对中间数据和一些标志只使用1个字节来存储，如用char类型来存储BranchCut数据、BlockFlag数据等。(42) Use only 1 byte to store intermediate data and some flags, such as using char type to store BranchCut data, BlockFlag data, etc.

（43）把Branch cut数据和PixelFlag合并，在Branch cut基础上增加一个值为2来表示PixelFlag为1时的意义。(43) Merge the Branch cut data with PixelFlag, and add a value of 2 on the basis of Branch cut to indicate the meaning when PixelFlag is 1.

由于这些数据在使用时都要读入到共享内存中，通过压缩存储，减少了共享内存，增加了同时激活的Block数量。图7和图8列出了FloodFill过程中使用不同压缩方法后共享内存的使用情况和同时激活的Block数量。Since these data must be read into the shared memory when they are used, the shared memory is reduced and the number of simultaneously activated Blocks is increased by compressing the storage. Figure 7 and Figure 8 list the usage of shared memory and the number of simultaneously activated blocks after using different compression methods in the FloodFill process.

本发明未详细阐述部分属于本领域技术人员的公知技术。Parts not described in detail in the present invention belong to the known techniques of those skilled in the art.

以上虽然描述了本发明的具体实施方法，但是本领域的技术人员应当理解，这些仅是举例说明，在不背离本发明原理和实现的前提下，可以对这些实施方案做出多种变更或修改，因此，本发明的保护范围由所附权利要求书限定。Although the specific implementation methods of the present invention have been described above, those skilled in the art should understand that these are only examples, and various changes or modifications can be made to these embodiments without departing from the principle and realization of the present invention. Therefore, the protection scope of the present invention is defined by the appended claims.

Claims

1. a kind of based on isomery accelerate platform two-dimensional phase unwrapping method it is characterised in that：Each step of methods described is equal Accelerate to realize on platform in isomery, realize step as follows：

(1) isomery accelerates to define inconsistent point in phase data on platform, and that is, Residue value is the point of +/- 1；

(11) phase data piecemeal, read in shared drive, each thread in sets of threads Block calculates a point Residue value, and preserve；Last column in order to ensure each Block can be according to the same rule calculating, often with last string Individual Block needs many reading a line and a column data；

(12) after obtaining Residue value, found and mated dipole child-operation (Dipole cut), described dipole is Refer to the point of Residue value that is adjacent and having opposite polarity, if the Residue value of this point is+1, find around this point Opposite polarity, if there are opposite polarity point, then this two points are labeled as BranchCut point, and a Residue value It is labeled as+2, another Residue value is labeled as -2, obtains Residue data；

(2) isomery accelerates to mate the process of inconsistent point, i.e. Branch cut step on platform, and this step is divided into two processes： Local matching and global registration；

(21) carry out local matching first, reading in shared drive after Residue deblocking, carry out in each Block Branch cut operates, first by Residue value for+1 looking for nearest Residue value polarity be bear (- 1, -2) point or Data boundary, until exceeding hunting zone；Looking for polarity by -1 point again after simultaneously operating is positive point, each Reading the Residue input data of N circle Block, i.e. each up and down many reading N row, N is a positive integer parameter, improves precision more； If the point that this Block has Residue value non-zero does not mate ,+1 or -1 is written as to mark BlockFlag of each Block, Its polarity is identical with Residue value polarity；

(22) thread configuration as and then carry out global registration, global registration is used with local matching, is read using a thread Take BlockFlag to indicate, if this BlockFlag is masked as 0, terminate；Explanation if this BlockFlag mark is not 0 This Block needs to carry out global registration；During global registration process, if BlockFlag value is+1, find nearest BlockFlag is -1 or borderline Block, calculates maximum distance between them and minimum distance as hunting zone, The sure point finding coupling in this range, the BranchCut data after being mated, BranchCut is that each point passes through Value after Branch cut operation；

(3) accelerate to realize FloodFill step, the phase data after output unwrapping on platform in isomery, be implemented as follows：

BranchCut data after input coupling, in BranchCut data, selected seed point, as starting point, is labeled as By unwrapping；Described seed point can not be by the point of Branch cut, by limited number of time restarting equipment function Kernel, with one kind The organizational form of dynamic Block spreads from center sub-states toward surrounding；Wherein inside Block, each thread is once Process a line or a column data, then four direction, that is, from left to right, from top to bottom, from right to left, enter successively from top to bottom Row unwrapping is processed, and carries out unwrapping process needs one subsynchronous, until whole Block after a thread process a row or column In do not need by when uncoiled stop iteration, output unwrapping after phase data.

2. according to claim 1 based on isomery accelerate platform two-dimensional phase unwrapping method it is characterised in that：Institute State in each step of method, the various data using and obtain all are stored using compression, to improve arithmetic speed in calculating.

3. according to claim 1 based on isomery accelerate platform two-dimensional phase unwrapping method it is characterised in that：Institute State (11) and (12) two steps merge, complete in same Kernel.

4. according to claim 1 based on isomery accelerate platform two-dimensional phase unwrapping method it is characterised in that：Institute State in the local matching of step (21), before reading in shared drive after Residue deblocking, first give Residue data pseudo- Make a border, referred to as pseudo- border, to the number of row or column be not 16 multiple when, be extended for 16 multiple, with one Value outside Residue codomain fills the part of completion；In each Block, mend N circle with same method to Residue data.

5. according to claim 1 based on isomery accelerate platform two-dimensional phase unwrapping method it is characterised in that：Described In step (3), process is realized from center toward surrounding diffusion type sub-states with a kind of organizational form of dynamic Block For：BranchCut data after input coupling is divided into the block of fixed size, referred to as small_block, a Block processes 4 The data of individual adjacent small_Block composition, referred to as one big_Block；Small_Block in big_Block point Cloth is in sphere of movements for the elephants type；Big_Block has two kinds of different organizational forms, double restarts Kernel, the Block of correspondence position The small_Blcok processing is incomplete same, and that is, the organizational form of big_Block can change, such small_Block meeting Rotation between different big_Block, some small_Block are just propagated into as seed Block after unwrapping immediately Another big_Block, thus can make sub-states spread toward surrounding from center；Two marks are needed in aforesaid operations： The state that Block_flag and Pixel_flag records Block respectively and each is put, Block_flag identifies this small_ Whether Block is twined by solution, and Pixel_flag identifies whether this point is twined by solution.

6. according to claim 5 based on isomery accelerate platform two-dimensional phase unwrapping method it is characterised in that：Described In step (3), described Block_flag can not only determine whether this Block will execute unwrapping operation and plant additionally it is possible to determine Son point is from which corner of big_Block to propagate, it is then determined that the execution sequence of four direction.