WO2014032008A2

WO2014032008A2 - Method and system for generating mesh from images

Info

Publication number: WO2014032008A2
Application number: PCT/US2013/056473
Authority: WO
Inventors: Panagiotis FOTEINOS; Nikos CHRISOCHOIDES
Original assignee: Old Dominion University
Current assignee: Old Dominion University
Priority date: 2012-08-23
Filing date: 2013-08-23
Publication date: 2014-02-27
Anticipated expiration: 2015-02-23
Also published as: WO2014032008A3

Description

METHOD AND SYSTEM FOR GENERATING MESH FROM IMAGES

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No. 61 /692,538, filed August 23, 2012, which is hereby incorporated by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

This invention was made with Government support under contract CCF-1 139864, CCF- 1 136538, and CSi-1 136536 awarded by the U.S. National Science Foundation. The Government has certain rights in the invention.

TECHNICAL FI ELD

The present invention relates to solid modeling through computer graphics,

BACKGROUND ART

image-to-mesh (1.2 ) conversion enables patient-specific finite Element modeling in image guided diagnosis and therapy. The ability to tessellate a medical image of multiple tissues into teirahedra enables Finite Element (FE) Analysis on patient-specific models This has significant implications in many areas, such as imaged-guided therapy, development of advanced patient- specific blood flow mathematical models for the prevention and treatment of stroke, patient-specific interactive surgery simulation for training young clinicians, and study of biomechanical properties of collagen nano-straws of patients with chest wail deformities, to name just a few.

There is a large body of work on constructing guaranteed quality meshes for Computer Aided Design (CAD) models. The specificity of CAD-oriented approaches is that the meshes have to match exactly to the boundaries of the models. The most widely used guaranteed-quality CAD-oriented approach is based on Delaunay refinement, Delaunay meshing is a popular technique for generating tetrahedral meshes, since it is able to mesh various domains such as: polyhedral domains, domains bounded by surfaces, or multi-labeled images, offering at the same time mathematical guarantees on the quality and the fidelity of the final mesh. In the literature, Delaunay refinement techniques have been employed to mesh objects whose surface consists of piecewise linear elements which are explicitly given,

Delaunay refinement is a popular automatic mesh generation and refinement method which generates and refines meshes by present algorithms and at the same time ensures termination and good grading. Mesh generation by Delaunay refinement is a widely used technique for constructing guaranteed quality triangular and tetrahedral meshes. The quality guarantees are usually provided in terms of the bounds on circumradius-to-shortest edge ratio and on the grading of the resulting mesh. Traditionally circumcenters of skinny elements and middle points of boundary faces and edges are used for the positions of inserted points.

Delaunay refinement is very useful for parallel and/or time critical applications when human intervention and reruns are prohibitively time consuming.

Delaunay refinement algorithms are based on the method of inserting new points into the mesh to improve the aggregate quality of elements (triangles in two dimensions or tetrahedra in three dimensions). Quality is traditionally defined as the ratio of the circumradius of the element to the length of its shortest edge, The use of this measure leads to the improvement of the minimum angle in two dimensions which helps to improve the conditioning of the stiffness matrix used by a field solver, in three dimensions this measure does not yield such direct benefits, however, it has been shown that bounded circumradius-to-shortest edge ratio of mesh elements is sufficient to obtain optimal convergence rates for the solution of Poisson equation using the control volume method.

As described, for example, in related U.S. Application Publication Nos. 2012/01 4397 and 2012/0200566, the entirety of each of which is incorporated by reference herein, one of the problems with Delaunay refinement in 3D is that it allows only for a bound on circumradius-to- shortest edge ratio of tetrahedra, which does not help to improve the dihedral angles. A bound represents a criterion for evaluating the quality of an element in the mesh. As a result, almost flat tetrahedra called slivers can survive. There are a number of post-processing techniques to eliminate slivers, as referenced in U.S. Application Publication No. 2012/0154397. While some of them have been shown to produce very good dihedral angles in practice, no implementation is available that can guarantee significant ( and above) dihedral angle bounds.

As further noted in U.S. Application Publication No. 2012/0154397, other work describes a guaranteed quality tetrahedral meshing algorithm for general surfaces. Such work offers a one-sided fidelity guarantee (from the mesh to the model) in terms of the Hausdorff distance, and, provided the surface is sufficiently smooth, also the guarantee in the other direction (from the model to the mesh). The referenced algorithm first constructs an octree that covers the model, then fills the octree leaves with high quality template elements, and finally warps the mesh vertices onto the model surface, or inserts vertices on the surface, and locally modifies the mesh. Using interval arithmetic, others have shown that new elements have dihedral angles above a certain threshold. However, images are not smooth surfaces and this technique has not been extended to mesh images. One approach could be to interpolate or approximate the boundary pixels by a smooth surface, but it would be complicated by the need to control the maximum approximation (interpolation) error. On the other hand, as described in U.S. Application

Publication No. 2012/0154397, an I2M solution can benefit from the fact that images provide more information on their structure than general surfaces. Heuristic solutions to the Ϊ2Μ problem fall into two categories: (I) first coarsen the boundary of the image, and then apply CAD-based algorithms to construct the final mesh, (2) construct the mesh which covers the image, and then warp some of the mesh vertices onto the image surface. The first approach tries to address the fidelity before the quality requirements, while the second approach does it in reverse order, Neither of these approaches is able to guarantee the quality of elements in terms of dihedral angles. As a result, the output of one step does not produce an optimal input for the other step.

As further noted in related U.S. Application Publication No. 2012/0200566, one of the central questions in the Delaunay refinement method has been the choice of the positions for the new points. The traditional approach uses circumcenters of mesh elements, however, a number of other locations have been used to achieve various mesh optimizations.

SUMMARY OF INVENTION

According to an embodiment, the present disclosure is directed to a system and method therefor for generating mesh from an image. According to another embodiment, the present disclosure is directed to a recording medium storing a program that instructs a computer to generate a mesh from an image according to the above claimed methods. The method includes processing a parallel real-time image-to- Mesh conversion using an algorithm (PI2M) for processing an image; wherein the system is configured to process the image without surface extraction pre-processing, The method includes recovering boundaries for an image and generating meshes through a sequence of dynamic insertion and deletion of points in a Delauna fashion. The image can be a three- dimensional medical image. The system implementing the method can be configured to process images tor at least one of finite element analysis, simulation, interpolation, computer vision, or robotics, Embodiments include a method for generating mesh from an image, the method being implemented by a computer system operative!}' connected to at least one data storage device for storing image data for images, at least one computer, and at least one computer readable medium storing thereon computer code which when executed by the a least one computer performs the method, in some cases, the method will include (a) receiving a digital image of an object © having a volume and a surface 9©;(b) applying a virtual box of eight vertices to the image of object © and triangulating the vertices into a plurality of tetrahedral t; (e) processing a parallel real-time Image-to- Mesh conversion using an isosurface based algorithm (PI2M) for processing the digital image; and (d) wherein the system is configured to process the digital image without surface extraction preprocessing. This method may include the steps of recovering boundaries for the digital image and generating meshes through a sequence of dynamic insertion and deletion of points in a DeSaunay fashion until circumceniers of ali teirahedra i lie inside ©. Meshing may comprise the generation of a volume mesh and an isosurface mesh of the object © derived from the image and not from a polyhedral domain. For example, without limitation, the image may be a three-dimensional medical image.

As detailed below, the method is adaptable to processing of the parallel real-time Image-to- Mesh conversion is by providing a plurality of parallel processing threads 7j, each of threads 7} having a poor element list of tetrahedrons t requiring refining.

Where δ is a predetermined sampling density, the meshes include a plurality of tetrahedrons i, /is a facet, and wherein the dynamic insertion and deletion of points in a Delaunay fashion comprises refining the meshes. This refining can include the application of refining rules, such as: inserting z \fz is at a distance not closer than δ to any other isosurface vertex, for intersecting tetrahedron and wherein z is equal to cfp (c (t));

inserting c(t) if radius r(t) > 2 * 5 for intersecting tetrahedron / and wherein z is equal to cfp (c

(0);

inserting c _surf ( ) wherein / is a restricted facet and either a smallest planar angle of / is less than 30° or a vertex of / is not a feature vertex;

inserting c(t) for interior tetrahedron t having a radius-edge ratio larger than 2;

inserting c(t) for interior tetrahedron t if r(i) > a user defined size function sf (c( ); and deleting free vertices closer than 2 · 6 to z whenever a Steiner vertex z is inserted into a mesh. Additional refining rules may include: (a) deleting, for any illegal facet, all vertices of the illegal facet that do not lie on the surface, and inserting a point p on Vor(f) f) 5©; and (b) rejecting, any sliver circumcenter that eliminates a facet/ that is legal, and inserting a point/; on Vor(f) Π dO.

As noted above, the processing of a parallel real-time Image-to-Mesh conversion may be by providing a plurality of parallel processing threads T>, each of threads Ί) having a poor element list of tetrahedrons t requiring refining. Optionally, the method may include the step of finding and triangulating cavities C of insertions. n one embodiment, the method may include the balancing processing load among threads 7) by one or more of random work stealing, hierarchical work stealing and static box partitioning, in another embodiment, the method may include managing contention in one of a variety of means, including by pausing at least one of the plurality of processing threads without livelocks in the remaining threads, For example, contention may be managed locally by designating one of the plurality of processing threads T>, as a main non-pausing thread and pausing as a function of the number of rollbacks at least one of the remaining of the plurality of processing threads T. without livelocks.

The present approach extends to a non-transitory recording medium storing a program that instructs a computer to generate a mesh from an image, the method comprising receiving a digital image of an object © having a volume and a surface d ©; applying a virtual box of eight vertices to the image of object © and triangulating the vertices into a plurality of tetrahedral /; processing a parallel real-time Image-to-Mesh conversion using an isosurface based algorithm (ΡΪ.2Μ) for processing the digital image; wherein the system is configured to process the digital image without surface extraction pre-processing. Such embodiments may include the variations common to such methods, including the above described optional embodiments.

The present approach extends to a system for performing a process on an image, the process being implemented by a computer system operatively connected to at least one data storage device in which is stored image data, and comprising at least one computer and at least one non-transitory computer readable medium storing thereon computer code which when executed by the at least one computer performs a method. In a system embodiment, the method may include ail variations disclosed herein for such method. For example, the method may include the steps of: (a) receiving a digital image of an object © having a volume and a surface 3 O; (b) applying a virtual box of eight vertices to the image of object © and triangulating the vertices into a plurality of tetrahedral t; (c) processing a parallel real-time Image-to-Mesh conversion using an isosurface based algorithm (PI2M) for processing the digital image: and (d) wherein the system is configured to process the digital image without surface extraction preprocessing. BRIEF DESCRIPTION OF D AWINGS

To the accomplishment of the foregoing and related ends, certain illustrative

embodiments of the invention are described herein in connection with the following description and the annexed drawings. These embodiments are indicative, however, of but a few of the various ways in which the principles of the invention may be employed and the present invention is intended to include all such aspects and their equivalents. Other advantages, embodiments and novel features of the invention may become apparent from the following description of the invention when considered in conjunction with the drawings. The following description, given by way of example, but not intended to limit the invention solely to the specific embodiments described, may best be understood in conjunction with the accompanying drawings, in which:

FIG, 1 illustrates aspects of system and storage or media embodiments.

FIG. 2(a) shows an image of a sample set of a liver's surface according to an embodiment.

FIG. 2(b) shows the Deiaunay trianguiation of the samples.

FIG. 2(c) shows the restricted trianguiation D_j0 (V) according to an embodiment.

FIG. 3(a) shows an image of a sample set of a liver's surface according to an embodiment.

F IG. 3(b) shows the Deiaunay trianguiation of the samples.

FIG, 3(c) shows the restricted trianguiation D_jC, (V) according to an embodiment.

FIG. 4 shows an implementation design,

FIGS. 5(a) and (b) show two-dimensional illustrations with re-triangulated areas,

FIG. 5 shows a graph representing the number of threads against non-local PEL removals.

FIG, 6 includes tables of specification and optimization.

FIG. 7 is a table of mesh statistics.

FIG. 8 is a chart of PEL removals,

FIG. 9 illustrates a mesh with statistics.

FIG, 10 is a table of contention manager performance.

FIG. l 1 shows scaling performance.

FIGS, 12(a) and (b) show exemplary communication patterns for embodiments of the present invention,

FIG. 13 shows data about scaling performance of Hierarchical Work. Stealing.

FIG. 14 illustrates a thread about to busy-wait.

FIG. 15 is a comparison among contention managers. FIG. 16 is a chart of specifications.

FIG. 17 shows data about scaling performance of Random Work Stealing.

FIG. 1 8 shows data about weak scaling performance.

FIG. 19 is an overhead data chart.

FIG. 20 is a chart of weak scaling performance.

FIGS. 21(a) and (b) illustrate aspects of point rejections strategy.

FIG. 22(a) illustrates a virtual box meshed into 6 tetrahedra.-

FIG, 22(b) illustrates a final mesh being carved by rules.

FIG. 22(c) illustrates an end set with tetrahedrai whose circumcenter lies inside the object as a geometrically and topological!}' correct mesh.

FIG. 23 shows aspects of an embodiment of a parallel mesh generation algorithm.

FIGS. 24(a)-(c) show aspects of an embodiment of a parallel mesh generation algorithm.

FIG. 25 illustrates a contention manager time steps,

FIG. 26 is a table of thread execution data.

FIG. 27 is a table of input image data.

FIGS. 28(a)-(b) show output meshes.

FIG. 29 is a table of single thread performance data,

DESCRIPTION OF EMBODIMENTS

Introduction

It is noted that in this disclosure and particularly in the claims and/or paragraphs, terms such as "comprises," "comprised," "comprising," and the like can have the meaning attributed to it in U.S. patent law; that is, they can mean "includes," "included," "including," "including, but not limited to" and the like, and allow for elements not explicitly recited. Terms such as "consisting essentially of and "consists essentially of have the meaning ascribed to them in U.S. patent law; that is, they allow for elements not explicitly recited, but exclude elements that are found in the prior art or that affect a basic or novel characteristic of the invention. Embodiments of the present invention are disclosed or are apparent from and encompassed by, the following description.

Embodiments of the invention may be implemented by a programmable digital computer or a system using one or more programmable digital computers and computer readable storage media, including those designed for processing CAD-based algorithms as known to those of ordinary skill in the art. In one embodiment, Figure 1 depicts an example of one such computer system 100. which includes at least one processor 1 0, such as, e.g., an Intel or Advanced Micro Devices microprocessor, coupled to a communications channel or bus 1 12. The computer system 100 further includes at least one input device 1 14 such as, e.g., a keyboard, mouse, touch pad or screen, or other selection or pointing device, at least one output device 1 16 such as, e.g., an electronic display device, at least one communications interface 1 18. at least one computer readable medium or data storage device 1.20 such as a magnetic disk or an optical disk and memory 122 such as Random- Access Memory (RAM), each coupled to the communications channel 1 12, The communications interface 1 18 may be coupled to a network 142.

One skilled in the art will recognize that many variations of the system 100 are possible, e.g., the system 100 may include multiple channels or buses 12, various arrangements of storage devices 120 and memory 122, as different units or combined units, one or more computer-readable storage medium (CRSM) readers 136, such as, e.g., a magnetic disk drive, magneto-optical drive, optical disk drive, or flash drive, multiple components of a given type, e.g., processors 1 10, input devices 1 14, communications interfaces 1 18, etc.

In one or more embodiments, computer system 100 communicates over the network 142 with at least one computer 144, which may comprise one or more host computers and/or server computers and/or one or more other computers, e.g. computer system 100, performing host and/or server functions including web server and/or application server functions. In one or more embodiments, a database 146 is accessed by the at least one computer 144. The at least one computer 144 may include components as described for computer system 100, and other components as is well known in the computer arts. Network 142 may comprise one or more LANS, WANS, intranets, the Internet, and other networks known in the art. in one or more embodiments, computer system 100 is configured as a workstation that communicates with the at least one computer 144 over the network 1 2. In one or more embodiments, computer system 100 is configured as a client in a client-server system in which the at least one other computer comprises one or more servers. Additional computer systems 100, any of which may be configured as a work station and/or client computer, may communicate with the at least, one computer 144 and/or another computer system 100 over the network 142.

Non-limiting exemplary embodiments of systems, methods and recording mediums which can implement the embodiments disclosed herein may be found in U.S. Patent Application Serial No. 13/31 1 ,335, filed on December 5, 201 1 , which claims priority to Provisional Application No.

61/419,632 filed on December 3, 2010, and to Provisional Application No, 61/482,797 filed on May 5, 201 1 , the entirety of each of which is incorporated herein by reference.

Non-limiting exemplary embodiments of systems, methods and recording mediums which can implement the embodiments disclosed herein may also be found in U.S. Patent Application Serial No, 13/31 1 ,2.89, filed on December 5, 201 1 , which claims priority to Provisional Application No. 61/419,603 filed on December 3, 2010, the entirety of each of which is incorporated herein by reference.

SECTION I

Sequential Aspect

in a specific aspect of embodiments of the present invention, a parallel Image-to-Mesh Conversion (12M) algorithm with quality and fidelity guarantees achieved by dynamic point insertions and removals is provided, Starting directly from an image, it is able to recover the surface and mesh the volume with tetrahedra of good shape, The tightly-coupled shared-memory parallel speculative execution paradigm employs carefully designed memory and contention managers, load balancing, synchronization and optimization schemes, while it maintains high single-threaded performance: th present single-threaded performance is faster than CGAL, the state of the art sequential Ϊ2Μ software. The meshes according to the present invention come also with theoretical guarantees: the radius-edge ratio is less than 2 and the planar angles of the boundary triangles are more than 30 degrees.

The effectiveness of the present invention is shown on Blackiight, the large cache-coherent NUMA machine of Pittsburgh Supercomputing Center. The present invention observes a more than 90% strong scaling efficiency and a super!inear weak scalirsg efficiency for up to 32 cores (2 nodes connected via one or more network switches). Due to the inherent irregularity of the application of the present invention and the latency involved in inter-node communication, the performance on more nodes deteriorates, but a strong scaling efficiency of more than 50% and a weak scaling efficiency of more than 70% for up to 64 cores (4 nodes) can be obtained.

Image-to-mesh (12 M) conversion enables patient-specific Finite Element modeling in image guided diagnosis and therapy. The ability to tessellate a medical image of multiple tissues into tetrahedra enables Finite Element (FE) Analysis on patient-specific models. This has significant implications in many areas, such as imaged-guided therapy, development of advanced patient-specific blood flow mathematical models for the prevention and treatment of stroke, patient-specific interactive surgery simulation tor training young clinicians, and study of biomechanical properties of collagen nano-straws of patients with chest wall deformities, to name just a few. A parallel high quality Delaunay Image-to-Mesh Conversion (P12M) algorithm is disclosed herein which, starting from a segmented (perhaps multi-labeled) image, recovers the isosurface 60 of the object © and meshes the volume at the same time, offering quality and fidelity guarantees, in the parallel mesh generation literature, either O is given as an initial mesh or 5© is already represented as a polyhedral domain. On the contrary, the present approach meshes both the volume and the isosurface directly from an image and not from a polyhedral domain. Therefore, the algorithm of the present invention exploits parallelism from the beginning of mesh refinement: starting directly from an image, and recovers the underlying surface and meshes the volume at the same time,

PI2M recovers the tissues' boundaries and generates quality meshes through a sequence of dynamic insertion and deletion of points in a Delaunay fashion. None of the parallel Delaunay refinement algorithms are known to support point removals. Point removal, however, offers new and rich refinement schemes which can be very effective in practice. For example, a parallel Triangulator is able to support fully dynamic insertions and removals. Therein, however, is dealt only with the conve hull of point sets ignoring quality or surface conformity. High quality and fidelity I2M is a very dynamic process that increases parallel complexity even more.

As described herein, the present invention builds on top of the Triangulator and implements a complete parallel Ϊ2Μ scheme able to recover the surfaces of the tissues and mesh the underlying volume with high quality elements,

While some have implemented a tightly-coupled method, and the inventors have also implemented a partially-coupled method for distributed systems. But both partitioned an initial mesh (using METIS or Zoltan, for instance) and then they distributed the work among processors for refinement. However, decomposition does not apply in the present invention, simply because there is no initial mesh, In fact, as described herein, the construction of such an initial mesh can be done in parallel with good speedup.

Similarly, some have started from a polygonal surface (PSLG) and offered decoupled and tightly coupled refinement schemes in 2D, respectively.

One method of the present invention is able to produce millions of high-quality elements within seconds respecting at the same time the exterior and interior boundaries of tissues. And this is possible because the input of the algorithm according to the present invention is not a given polyhedral domain. In contrast, on the fly, the zero-surface and the volume are meshed

simultaneously which offers great flexibility on controlling the fidelity-quality trade-off. A parallel Ϊ2Μ tool is not known which, starting directly from a segmented image, is able to perform the operations above at once.

The functionality of the present invention renders the described method suitable for both model visualization and real-time finite element (FE) analysis required in medical simulators, FE analysis is sensitive to the quality of tetrahedral meshes. n the literature, there are several quality metrics used to improve mesh quality such as aspect ratio, size, scaled Jacobian, dihedral angles and others, as referenced herein. As metrics of interest, the present invention chooses aspect ratio, radius- edge ratio, size, and dihedral angles, because they are shown to improve the speed and robustness of medical application solvers dealing with isotropic materials.

For the parallel implementation, the present invention follows a tightly-coupled shared memory algorithmic model. Tightly-coupled, in order to develop a parallel framework able to accept various refinement schemes with minimal code change. The shared memory programming model fits best the forthcoming many-core chips demonstrating tens of cores. In one aspect, the present invention targets distributed shared-memory (DSM) machines equipped with both few and many more cores.

Delaunay refinement is a highly irregular and latency-bound application and is by far more dynamic in terms of resource management than dense matrix multiplication, FFT, structured adaptive mesh refinement, or N-body applications. The present invention shows the effectiveness of PI2M on Pittsburgh Supercomputing Center's Blaekiight Blacklight is a 2,046 hardware enabled cache- coherent NUMA (cc-NUMA) machine.

The parallel implementation according to the present invention employs suitable memory management schemes, light-weight contention policies, custom load balancing techniques, and well- designed optimizations, as justified by extensive case studies and hardware event counting. The single-threaded version of the algorithm according to one embodiment of the present invention, although it introduces extra synchronization and communication overhead (that is needed in order to accommodate more than one thread), it is highly optimized. In one aspect, this version of the algorithm of the present invention is consistently more than 30% faster than CGAL, the fastest open source mesh generation software known.

in summary, the method presented here (or P12M) exhibits fast single-threaded performance, supports parallel Delaunay-based insertions and removals, conforms to user-specified quality metrics, recovers and meshes the isosurface with topological and geometric guarantees from the beginning of mesh refinement, and thus exploits more parallelism, allows the integration of any parallel refinement strategy, and fills the volume of the domain O with millions of high-quality tetrahedra within seconds, i.e., in real-time,

There is extensive previous work on parallel mesh generation, including various techniques, such as: Delaunay, Octree, or Advancing Front meshing. One of the main differences between the method of the present invention and those techniques is that they either have the surface of the domain given as a polyhedron, or the extraction of the polyhedron is done sequentially, or they start by an initial background octree. As explained above, the method according to the present invention constructs the polyhedral representation of the object's surface in parallel, together with the volume meshing, thus taking advantage of another degree of parallelism.

Delaunay meshing is a popular technique for generating tetrahedral meshes, since it is able to mesh various domains such as: polyhedral domains, domains bounded by surfaces, or multi-labeled images, offering at the same time mathematical guarantees on the quality and the fidelity of the final mesh. In the literature, Delaunay refinement techniques have been employed to mesh objects whose surface consists of piecewise linear elements which are explicitly given. As discussed herein, the present approach deals with objects whose surface is a smooth 2-manifold and is not explicitly given; and it is the present invention's algorithm's responsibility to mesh both the surface and the interior of the object such that the mesh boundary describes the object surface well.

For proving geometric and topological fidelity, the present invention makes use of a sample theory. Before introducing the sample theory, some terminology is necessary,

Sample theory assumes that. © is represented as a cut function /:R³→ , such that its surface dO is defined by the set {/(p) - 0). This assumption covers a wide range of inputs, including domains enclosed by curved surfaces as well, although this paper focuses on images. Clearly, from a segmented image, the zero-surface {/ (p) ~ 0} can be easily computed by interpolating the voxel values.

Let VC R³ be a set of vertices and (V) their Delaunay triangulation. Any Delaunay triangulation satisfies the empty hall property: the circumscribing open ball (a.k.a circiimhali) of each tetrahedron in V (V) does not contain any vertex.

The voronoi point of a tetrahedron t e 2? (V) is defined as the center (a.k.a circwncenter) of fs circiimhali. The voronoi edge of a triangle /€ V (V ) is the segment containing those points of IV such that (a) they are equidistant from / 's vertices and (b) they are closer to s vertices than to any other vertex in V.

Let © be the multi-label domain to be meshed. The surface of © is denoted with 9© The restriction oft⁾ (V ) to O (denoted with £^» _{j 0} ( V)) is defined as the set of tetrahedra in the trianguiation whose circumventer lies inside O.

it can be shown that if V samples 60 sufficiently densely, then the set of boundary triangles (also referred to as restricted facets) of2?_{) 0} (V ) is a good approximation of 50 in a both topological and geometric sense. The approximation guarantees hold as long as SO does not have sharp comers. This is a reasonable assumption, since biological tissues do not exhibit sharp features on their surface. See Figures 2 & 3 for a single-tissue example, with 3(a) a sample set V of a liver's surfaec,3(b) the Delaunay trianguiation V ( V) of the samples, and 3(c) the restricted trianguiation 2>_{! 0} ( V). The same idea extends to more than one tissue as well.

As an interesting consequence of the way Z> _{t 0} ( V) is defined, only the voronoi edges of the restricted facets intersect the surface 50, a property that will be exploited below to improve quality and fidelity. For a restricted facet /, the intersection between / 's voronoi edge and 50 is denoted by

Finally, the sample theorem could be stated as follows:

Theorem 1. Let V he samples of 30, If at any point p e SO, there is a sample v e Vsuch that \v - P I < δ, then the boundary triangles oft>._Q (V) is a provably good topological and geometric approximation ofdO.

Typical values for δ are usually fractions of the local, feature size of 6 . In the present application, δ values equal to multiples of the voxel size is sufficient.

In the sequel, C (t) denotes the circumventer of tetrahedron's t circumball and } '{t) its radius. For an arbitrary point p, the function cfp (p) returns the surface point which is the closest to p.

Given a segmented image, a present algorithm starts with a literally empty mesh. It initially inserts the corners (8 vertices) of a box containing the object, and it triangulates them. The computation of such a box can be done easily by a parallel image traversal Then, it dynamically computes new points to be inserted into or removed from the mesh maintaining the Delaunay property, This process continues, until certain Rules are satisfied. These Rules enforce fidelity and quality.

Specifically, the vertices removed or inserted are divided into 3 groups: free vertices, feature vertices, and box vertices. Box vertices are those lying exactly on the faces of the bounding box. Feature vertices lie on dO. One of the goals of the refinement is to insert many feature vertices in order to conform to Theorem 1 , Parameter 6 is specified by the user and controls the density of the sampling. The smaller δ is, the better the boundary of the final mesh approximates <¾^s). Users, who desire to represent certain parts of the domain in more detail, can pass to the present algorithm their own size function sf (^»).

During the cthe presentse of refinement, one may keep track of two kinds of tetrahedra:

intersecting tetrahedra and interior tetrahedra. Intersecting elements are those whose cireutnball intersects dO and interior are those whose circumcenter lies inside the object, A refinement scheme may obeys to the following 5 Rules:

Rl : Let t be an intersecting tetrahedron and Z be equal to efp (c (t)). If Z is at a distance not closer than δ to any other feature vertex, then z is inserted.

R2: Let t he an intersecting tetrahedron and Z be equal to efp (C (t))< If >"( ./ > 2 ⁴ 6, then c (t) is inserted.

R3: Let / be a restricted facet. If either its smallest planar angle is less than 30° or a vertex of / is not. a feature vertex, then C_5Urf (/)-«s inserted.

4: If t is an interior tetrahedron whose radius-edge ratio is larger than 2, then c (t) is inserted.

R5; Lei t be an interior tetrahedron, if r(t) > sf (c (t)}, then c (t) h inserted.

Every time a Steiner vertex z, inserted into the mesh as dictated by the five rules above, happens to be a feature vertex, then all the free vertices closer than 26 to Z are deleted. Whenever there is no simplex for which Rl , R2, R3, R4, or R5 apply, the refinement process terminates. The final mesh reported is the set of tetrahedra whose circuracenters lie inside © (i.e., interior tetrahedra whose union is the restriction to ©) as explained elsewhere herein. It has been shown that the algorithm always terminates generating tetrahedra with radius-edge ratio less than two and boundary triangles with planar angles less than thirty degrees approximating the surface in a topological and geometric sense, it is the first time Delaunay refinement is shown to terminate with such a low radius- edge ratio.

The dihedral angle improvement is performed via special point rejection strategies, as described below. These strategies remove points close to surface and insert others on the surface and are shown to work efficiently in practice, in all the present experiments, all the dihedral angles were larger than four degrees,

Multithreaded Aspect

The design in Figure 4 illustrates the basic building blocks of the present multi-threaded design. The Poor Element List (PEL) contains the poor eleiBents classified by the rule that applies to them. Based on the rule that applies to a poor element, a point is computed for insertion or removal that refines t. and is passed to the Triangulator. The new and deleted cells have to be updated and the new poor elements are pushed to a PEL according to the load balancing scheme. In case of a rollback, the contention manager is invoked.

1) Poor Element. List (PEL); Each thread '/'.maintains a poor element (mulit) list (PEL.), A tetrahedron t can only reside in at most one PEL. Depending on the rule that applies for t, a specific point p is computed for insertion or removal and is passed to the Triangulator. When a certain bad element t in PEL. is examined, T. acquires it by locking its vertices, If t is already locked (by the Triangulator, see below), then T moves on to the next bad element in PEL .. If locking was successful, then a point p for insertion or removal is computed, according to the Rule that applies for t.

2) Triangulator: If the operation is an insertion, then the cavity C (p) needs to be computed and re-triangulated. The cavity is defined as the set of those elements whose eircumball contains p, i.e., they violate the Delaunay property. Computing the cavity involves two steps: (a) locating the first tetrahedron in C (p), and (b) delete the tetrahedra in C (p) and create new ones that satisfy the Delaunay property. For finding the location of the first element in the cavity, one may make use of the visibility walk. Specifically, starting from the poor element t, orientations tests are performed with respect to J?, until the first element that contains ? is reached. Having found that first element, the second step is performed via the Bowyer- Watson kernel: point p is connected with the boundary triangles that are left in the cavity. It can be shown that the new elements do not intersect each others' interior and also satisfy the Delaunay property. See Figure 5 for an illustration in two dimensions: (a) the cavity C p) of p consists of elements whose circumcircle contains p It is re-triangulated by connecting j? with the vertices of the cavity; (b) the ball (p) of J? contains the elements incident to p . This area is re-trianguialed in Delaunay fashion by the "hail elements,"

Removing p involves the computation and re-triangulation oi p ^'S ball 2 (p). This is a more challenging operation than insertion, because the re-triangulation of the hole in degenerate cases is not unique which implies the creation of illegal elements, i.e., elements that cannot be connected with the corresponding elements outside the ball. One may overcome this difficulty by computing a local

Delaunay triangulation ,_p)(or 2? for brevity) of the vertices incident to p, such that the vertices inserted earlier in the shared triangulation are inserted intot>„ first. It can be shown that these new

B

locally created ball teirahedra can always be connected back to the shared triangulation without introducing invalid elements. See Figure 5 for the illustration in two dimensions.

In order to avoid races associated with writing, reading, and deleting vertices/cells from a PEL, or the shared triangulation, any vertex touched during the operation of element location, cavity expansion, or ball filling need to be locked. One may utilize GCC's atomic built-in functions for this goal, in the case a vertex is already locked by another thread, then there is a rollback: the operation is stopped and the changes are discarded. When a rollback occurs, the thread moves on to the next bad element in the list,

3) Update Cells: When an operation is successfully completed, the new and deleted cells have to be updated. Specifically, deleted elements already in J"s PEL have to be removed from the list and the newly created elements have to be checked in order to determine whether or not a Rule applies for them and pushed back into a Poor Element List, The PEE that the new poor elements are pushed into is determined by the Load Balancer.

4) Load Balancer: When the algorithm starts, only one thread (the main thread) has elements in its PEL, as a result of the triangulation of the initial box. Clearly, the work needs to be distributed to other threads as well from the very early stage of refinement. Implemented and compared were three load balancing techniques that naturally matched the present implementation design: random work stealing, hierarchical work stealing, and a static box partitioning. Load balancing is described in detail herein.

5) Contention Manager (CM): In order to eliminate live-locks caused by repeated rollbacks, threads talk to a Contention Manager (CM), its purpose is to pause the execution of some threads making sure that at least one will do useful work so that system throughput can never get stuck, See Section VII for two approaches that not only guarantee absence of livelocks, but they are able to greatly reduce the number of rollbacks and yield a considerable speedup, even in the generation of ver small meshes. One may also show that CMs also decrease the amount of idle time, which implies a higher FLOPS/ Watt ratio. The contention manager is invoked every time there is a rollback, Evaluation

Optimizations were implemented justifying their selection with extensive experimental analysis, as described herein, along with performance evaluation between the single-threaded algorithm of the present invention and CGAL, the state-of-the art open source mesh generation tool

Optimization

A ΡΑΡΪ library was used to count TLB+LLC misses and inter-socket QPI remote accesses. Used here are both CRTC (the present cc-NUMA machine) and Blacklight. See Figure 6 for the specifications of these systems. The intra-node communication is achieved through Intel's Quick Path interconnect (QPI). QPI also enforces cache-coherence (hardware enabled) via MESIF snooping. Accesses through QPI is an order of magnitude higher than accesses to the socket's local memory (as a result of L3 cache misses). The real overhead, however, is when a socket has to talk outside its node. Blacklight's inter-blade commu ication is achieved via an imaging too, uch as the NUMAiink® 5 fat tree network available from Silicon Graphics International, Inc. Despite the network's sufficient bandwidth, there is a 2,000 cycle latency per hop, which is by far the bottleneck in communication. A goal, therefore, was to: (a) decrease TLB+LLC misses for improved intra-socket communication, and (b) to decrease the inter-socket accesses as a result of inira-blade or (worse) inter-blade

communication. Hereafter, inter-blade accesses may be referred to as remote accesses.

The main optimizations implemented include: (1) vector reuse, (2) custom memory pools, (3) class member packing, (4) local PEL cell removals, (5) replication, and (6) light-threaded memory pool interactions. The first two optimizations are extensions of the optimizations suggested by Antonopoulos, et al, so that dynamic removals can benefit as well. Figure 6, Table if summarizes the findings.

For the first the present optimizations (see Figure 6, Table 11(a)), the algorithm of the present approach ran on one and on six cores of CRTC, For each execution, the present invention reports the savings on the execution time and on the TLB+L3 misses that were observed when the optimization was turned on. This embodiment used six threads for the multi-threaded experiments, because it was desired to pin them on a single socket (of CRTC). Therefore, inter-socket communication via the Intel QPI link do not affect the timings, and thus, only measured was the impact of TLB and LLC misses. The goal of the last two optimizations is to decrease the inter-socket accesses (see Figure 6, Table 11(b)). For this experiment, four and eight sockets of Blacklight (or equivaSently were used, two and four blades) to demonstrate the improvements yielded by these optimizations, When a certain optimization is evaluated, the previously evaluated optimizations are kept on.

Note that different runs in Blaeklight might be assigned to a different CPU mask, which might be associated with a higher number of hops. Therefore, in order to get consistent results, each related run was pinned on the same physical cores.

1) Vector reuse: During refinement, certain vectors are accessed very often. There are three types of such vectors: (a) the vector that stores the ceils in a cavity to be deleted, (b) the vector that stores the ball ceils in V to be connected back to the shared triangulation, and (c) the vector that

Έ

stores the incident vertices of a vertex to be removed.

in the early stages of implementation, these vectors allocated new space every time and after the completion of the operation this space was freed. In order to decrease the number of system calls on d e shared memory (which introduces also extra overhead for synchronizing the global memory resources), each vector reserved space with size equal to a certain number of elements. When the operation completes, only the elements are discarded; the memory allocated is reused to store elements tor the next operation. When there is not enough space to accommodate the desired number of elements, the space is expanded (doubled). In order to minimize the frequency of vector expansions without introducing intense memory segmentation, some tests were performed on various images to determine the length of these vectors, Figure 7, Table 111 provides measurements for the length of the vectors frequently used during the course of refinement.

The first column corresponds to a mesh of 42,049 elements and the second to a much larger mesh of 833,860 elements. Both meshes were created for the knee atlas, Although the ma length values strongly depend on the size of the mesh, the average values are consistent and seemingly independent of the size of the mesh. (Slight change in the average number of cells in a cavity agrees with the findings in as well). Therefore, for each vector, space was initially reserved that was able to hold its average number of elements. Vector reuse boosts the speed by a little (see first row of Figure 6, Ί able 11(a)), which contradicts the finding in where a 36% improvement is reported. This difference was attributed to the fact that in 3D refinement: a) the computation involved is three times more than in 2D and b) the pointer chasing is dominating over the vector traversals, Indeed, the average size of a 3D cavity is more than 3 times larger than the size of a 2D cavity, which implies that finding the elements in a 3D cavity involves more pointer chasing. For these reasons, the improvement of vector re-use does not show.

2) Custom memory manager: Rather than allocating and de-allocating vertices or cells one at a time, the present invention maintains a memory pool for ceils and vertices for each thread. When a new element is created, it requests space from its (private) pool. If an element is deleted, its memory returns back to the pool, and therefore the memory footprint is decreased which improves both the single and the multi-threaded application, Case studies show that the number of cells is roughly 3 times more than the number of vertices created during the refinement. The present invention sets the chunk size tor ceils three times larger than that for the vertices. Setting the chunk size for the vertices o

yields the best results. See the second rows of Figure 6, Table 11(a). In contrast to vector reuse, the memory pools paid off. This is also confirmed by the misses decrease.

3) Class Member Packing: The classes that represent the cell and the vertex need to store boolean variables used for checking the status during a traversal, the invalidations, and the rule that applies for them, Instead of declaring them as boolean, they were packed in BitSets. so that every variable occupies now a single bit. This decreases the memory occupied per ceil and it also results in a more compact alignment, The size of the cell and vertex class were now reduced by 20% (88 bytes totally) and 40% (48 bytes totally), respectively, and therefore, caching is likely to be improved. Indeed, the savings in TLB+LLC misses (see third row of Figure 6, Table 11(a)) are more than those of vector reuse. Although the memory manager is on, there was still a small improvement in time,

4) Local PEL cell removals: Extra care is required when 7 deletes a cell c-j already in the PEL,; of another thread T, A first attempt was to let T . emove C from PEL_j,, This was called removal of a non-local PEL removal to distinguish it from a local PEL removal where T removes a cell from its own PEL.. This, however, although easier to implement, requires extra synchronization and communication between T. and T Specifically, T has to lock 3 PEL ceils each time it wants to visit a node, because a removal might he in progress nearby. Figure 8, however, shows that even in the case of a high number of threads (128), non-local PEL removals account for about 33%,

For this reason, T . is not allowed to remove C. from PEL., Rather, it raises c.'s invalidation j · j

flag, When T , reaches Cj before anything else, it cheeks whether or not Cj is invalidated, if yes, then it removes Cj from its own PEL and it moves on to the next bad element, The advantage of this approach is that there is no need for synchronization. The disadvantage is a possible increase in the memory footprint, since now Cj does not return its address to the memory pool for re-use immediately after it is not needed any more, but only when T finds out that it is invalidated. However, a radical increase in memory footprint was not expected as the graph in Figure 8 suggests. Indeed, the last row of Figure 6, Table 11(a) shows that local PEL removal is superior. In fact, better caching was observed which contradicts intuition, This is attributed to the fact that the memory pools are large enough to cover the need for more cells even if they are not deleted right away,

5) Replica turn: n order to decrease the number of QPI accesses, read-only data structures (the segmented image for example) were replicated per socket, taking advantage of the "distributed part" of NUMA machines' architecture. Therefore, threads of the same socket have their own copy stored in their local memory hank. As shown in Figure 6, Table 11(b), much faster executions were obtained, as a pure result of a 99% remote accesses decrease in both 32 and 64 threads (the TLB+LLC misses remained almost intact).

6) Light-threaded pool interactioss: As explained elsewhere herein, when Ί deletes an element e, e's address is returned to 27s private pool. However, it might be the case that e is allocated by a remote thread. As a result, Γ re~uses remote space. To resolve this issue, T.retums e's address to the remote thread's memory pool. This introduces an extra pool synchronization, but as Figure 6, Table 11(b) suggests, when the number of threads increases, this optimization decreased the QPI accessed and as a result the overall performance, For lower core counts, the optimization yielded worse results (see the negative percentages), because the extra intra-blade communication associated with the pool synchronization dominates. Nevertheless, since large number of cores are of interest as well, this optimization was kept on,

Single-threaded evaluation

The first sequential prototypes of these refiner techniques were initially built on top of CGAL's triangulator. CGAL is the state-of-the-art mesh generation too! widely accepted by the community, it is a well-tested and robust library, and is believed to be the fastest sequential Delaunay mesh generator which operates directly on images, CGAL incorporates many optimizations that speeds up the performance such as: Delaunay hierarchies for locating, ray shooting, occasional switch from the BW kernel to edge flipping, object counters, and so forth. Although these choices improve the performance on a single core, their paraileiization is quite difficult and might not. pay off. (For example, CGAL's reference counters render even the read accesses thread unsafe.)

See Figure 9(c), in order for the evaluation to make sense, the sizing parameters of CGAL were set to values that produced meshes of similar size with the present invention. Observe that the resulting meshes are of similar quality and that the single-threaded PI2M is 35% faster than CGAL. Notice, that since the refiner removes points (CGAL performs only insertions for refinement), the present actual ^seconc^^s |_s larger than what it. is reported in Figure 9(c),

The input image used in this study is a 48-tissue MR! knee atlas freely available in outside references. Other input images yielded the same results: PI2M generates meshes of similar quality to CGAL's and 30% faster. Figure 9(a) illustrates the mesh generated by PI2M in accordance with the present invention, and Figure 9(b) illustrates a cross-section thereof. In one embodiment, CRTC (see- Figure 6. Table Ϊ for its speci ications) was used for this case study.

The role of the Contention Manager (CM) is twofold: (a) to guarantee correctness in a multithreaded algorithm and (b) to reduce the number of rollbacks. A common pitfall of non-blocking parallel algorithms is the lack of liveloek prevention. A livelock is a system-wide, global starvation, where there is no progress: threads compute their cavity/hole, lock the con-esponding vertices, only to find out that they have to roll back again. Some efforts overlook that problem and, for that reason, invalid Delaunay elements are reported. Others have tried to solve the problem by bootstrapping: i.e., they insert the first 500, 000 vertices sequentially (they do not report the exact number of elements, but based on experience and the fact that these points tbiiowed the uniform distribution, there must be more than a million tetrahedra). Similarly, some construct first an initial mesh of 100,0(50 tetrahedra sequentially and then the work is distributed among the workers, it is believed that even in the beginning (i.e., when there are not many elements present) there is parallelism that can be exploited. Therefore, instead of bootstrapping, it was decided to follow a more dynamic (yet simple to implement and with little overhead) contention policy.

Two contention techniques were implemented: a global Contention Manager and a local Contention Manager,

A. globai-CM

Each threads T_:. tracks down its progress by computing its corresponding progress ratio prs defined as: ρη = Sopsj/Aops,, where Sopsj is the number of successful operations and Aopsj is the number of attempted operations, i.e., completed operations plus the number of operations that started but were cancelled because of a rollback, A high value of pr; means that Ί does not encounter many rollbacks and thus does useful work, if ρ drops below a threshold pr^", then the execution of Γ, is suspended (unless ail the other threads have been already suspended). If pr; exceeds a threshold pr^÷, then it randomly wakes an already suspended thread. The awakened thread sets its progress ratio to a predefined value pt. The computation of the progress ratio is performed locally by each thread without synchronization. Although the user is required to specify the values of three parameters (p , pr, pr*), timings were not observed to vary considerably for different triplets. For this study, the triplet is assigned to (0.1 , 0.2, 0.9).

B. iocal-CM

One drawback of globai-CM is that there are global variables that are accessed by all threads. For example, the suspend/awake function is implemented as a global condition variable. Also, the number of "active" threads should be tracked down in order to prevent the suspension of the last active- thread. This cannot scale well for large number of cores, Therefore, global~CM can be replaced by locai-CM: a manager with zero communication and synchronization among threads. Thread T; now counts the number of rollbacks ri it encounters that were not interrupted by any successfully operation. That is, if 7) performs a successful operation ΐ- is set to 0. When a rollback occurs, T, sleeps for r,- x f seconds (f is a parameter). Lastly, the main thread is not allowed to sleep at all. This policy is locally performed by each thread and it also guarantees the absence of livelocks. The reason is that if all threads get stuck, then they will eventually sleep for a long time, letting at least one thread (the main thread) to do some useful work, in one embodiment, setting r to the duration of 5,000 clock cycles yielded the best results.

C. Comparison

A comparison is presented in Figure 10, Table IV.

In this case study, the final mesh consists of about 80,000 tetrahedra and the number of threads is twelve. Even in this case of a very small mesh (in the literature, speedups are reported for multi-million element meshes), considerable improvements are obtained, Generating small meshes (i.e., highly contented meshes) in real time is an application best suitable to many-core chips; they can drastically benefit from contention managers. This technique speeds up also the generation of larger meshes (i.e., the ease of more cores), since the reduction of rollbacks implies more useful FLOPS earlier in the refinement.

The single threaded execution time was 1.19 seconds, in one embodiment. Launching twelve- threads without any contention policy caused a livelock, Globai-CM solved the problem immediately yielding a speedup of 1 ,9. The 8-fold speed up of locai-CM reduced not only the number of rollbacks (as compared to globai-CM), but also the number of idle seconds. This improvement may be due to three reasons: (a) there is extra overhead associated to a condition variable call than a simple sleep command, (b) threads are given the chance to sleep less time since the time they go to sleep is a multiple of //sees proportional to their failure rate, and (c) iocal-CM requires no

communication/synchronization . Other increment ratios for f (for example exponential counter) did not yield considerably different results, The strong scaling speedup ^*·*"^»* and the weak scaling efficiency ^"'^WM* achieved by PI2 of the present invention are described herein, The input image used is a CT abdominal atias obtained by IRCAD Laparoscopic Center,

Load Balancing

Recall that initially there are only six elements in the mesh (initial box elements) which are likely to belong to the main thread's PEL, Thus, load balancing is needed from the beginning of refinement.

Load imbalance is a challenging problem associated with both regular and irregular applications, and may involve work stealing, multi-list, diffusive, gradient, and re-partitioning techniques, it is believed thai these techniques behave very well (some better than others) for unsiructured mesh generation, assuming that the tasks are independent. In other referenced works, however, it is also shown that equi-distributing the tasks among processors would not yield good results for mesh generators characterized by intensive inter-task dependencies.

For the reasons above, the present approach may implement the traditional work stealing (WS) as a base load balancer. Specifically, if a poor element list (PEL) of a thread 2} is empty of elements, Ti writes its ID to the begging list, a global array that tracks down threads without work. A running thread 7}, every time it completes an operation, it gathers the newly created threads and places the ones that are poor to the PEL of the first thread 7) found in the begging list. The classification of whether or not a newly created cell is poor or not. is done by 7}.

In the lines of Figure 3 1 , the speed-up, the number of QPI accesses, and the percentage of the total number of rollbacks with respect to the total number of attempted operations are plotted.

Specifically, Figure 1 1 is a graph showing scaling performance on Blackiight achieved by Work Stealing (WS), scaling performance on Blackiight achieved by Hierarchical Work Stealing (HWS), and scaling performance on Blackiight achieved by HWS with Static image decomposition

(HWS+Static).

Optimizations

In order to decrease the communication overhead associated with remote memory accesses, the present approach implements a Hierarchical Work Stealing scheme (BWS) by taking advantage of the architecture of Blackiight, The present approach can split the begging list in three levels.

Specifically, there is a Level ! begging list per socket (abbreviated by B l ), a Leve!2 begging list per blade (abbreviated by B2), and a global LeveB begging list B3 for the whole system. When the poor element list of thread Ti is empty, it throws itself to its Bl . Bl is accessed only by the S cores (i.e., up to 16 threads if hyper-threading is enabled) of a specific socket. The last thread asking for work, however, does not use Bl , but instead, it promotes itself to the second level begging list B2. Hence, B2 is accessed by only two threads per blade (one thread per socket) at a given time. Similarly, the second thread of the "B2 group" is promoted to the last level begging list B3, in the worst case, B3 will be accessed by a thread of each blade. A running thread Ti, which completed an operation and gathered the newly poor elements, sends new work to the begging threads giving priority to the lower level begging lists first. Only when Bl and B2 of T_j are empty, T_j accesses the global begging list See the lines of Figure 1 l(a)-(c). HWS greatly reduces the number of remote accesses and the associated latency, thus achieving a better speed-up than the plain WS.

Note that threads do not ask for work until they completely run out of bad elements, H WS can be modified such that a thread asks for work, when the number of its bad elements drops below a certain threshold. This change not only did not improve performance, but it actually increased the execution time by 2 to 10 seconds, potentially for two reasons. First, extra synchronization and communication for the Poor Element Lists (PELs) is now introduced: a thread might add elements to another PEL while it is being traversed by thread 7}, Second, the improvement of the time that a thread is waiting (idle time) is diminishing, Before the modification with the threshold, no thread is busy- waiting for more than 1 % of the total execution time.

Figure 12(a) depicts a communication pattern for Hierarchical Work Stealing (HWS) and figure 12(b) depicts a communication pattern for Hierarchical Work Stealing with Decomposition (HWS+Static). Figure 12(a) depicts the cells created by each thread in different shading, including areas where intensive thread communication and synchronization take place. In certain areas, three or even four present threads meet, in order to decrease thi communication, the image is divided into structured blocks. Each block is assigned to a specific thread, The goal is to force the threads to operate only on their dedicated block as much as possible. For this reason, if a newly created element t of thread Tjs classified as poor, 7\ determines which block contains the point that has to be inserted/removed for t. Then 7\ places t to the Poor Element List of the thread of the corresponding block. Clearly, this static image decomposition is possible, only if one is given some extra knowledge about the orientation of the object, in certain respects, this is the best achievable, since the actual polyhedral surface of the represented object is not known a priori, and therefore, other more elaborate partitioning schemes cannot be applied without introducing overheads. Also, note that this decomposition does not replace HWS; rather, it augments it, since again if a thread runs out of work, it takes some from any of the three begging lists. Figure 12(b) shows that this scheme (referred to as HWS+Statie) decreased thread contention. However, its perforrnance is not better than HWS when the core count increases (see the lines of Figure 1 1 ). This result can be attributed to the fact that with many blades, the communication and synchronization overhead, associated with Tj placing t to a remote PEL, dominates. As Figure 12(b) illustrates, after two blades, the remote accesses of

HWS-t-Static are more than HWS's,

Figure 13, Table V depicts the weak scaling performance for HWS, (HWS is fester than HWS+Statie by 1 % up to 50%.) The algorithm of the present invention ran on two configurations: one thread per core (non hyper threaded configuration, or n-FIT) and two threads per core (hyper- threaded configuration or HT), Note that the size of the problem increases with respect to the number of threads, Also, the problem size increases slower than it should.

For the hyper-threading configuration (HT), the algorithm of the present invention is applied on the same problem sizes, but each core accommodates two threads. Therefore, the HT configuration involves half the blades of the corresponding n-FIT configuration, and hence, it is expected to reduce the QPI accesses. However, the 32-thread (1 blade) HT configuration involves 70% more QPI accesses than the corresponding n-HT configuration (2 blades). This is counter-intuitive (since the HT configuration has fewer blades and therefore less communication), but it can be explained by the fact that there are more threads per socket and the possibility to talk to another socket increases, in the case where the QPI accesses do decrease (by 21 % on 256 threads and by 100% on 16), the resource sharing still deteriorates performance.

The present approach thereby provides PI2M as a novel parallel Image-to-Mesh (I2JV1) Conversion tool Starting directly from a segmented 3D image, the present approach is able to recover and mesh both the isosurface of the domain with geometric and topological guarantees and the underlying volume with quality elements.

This approach shows excellent strong and weak scaling performance given the dynamic nature of its application. Specifically, the present invention observes a 91 % strong sealing efficiency and a superlinear weak scaling efficiency for up to 32 cores (2 nodes connected via network switch). For a higher number of nodes, the overhead of the irregular thread communication dominates deteriorating performance. The approach shows that 28% (16 threads) up to 33% ( 128 threads) of the total number of poor elements appeared in the mesh were not refined by the same threads that created them.

Nevertheless, it still obtained a 71% weak sealing performance on 64 cores with the non-hyper threaded Hierarchical Work Stealing configuration. Extensive case studies show that 3D Delaunay meshing involves three times more computation than 2D on the average, it is three times more memory intensive (pointer chasing), and if the point removals are taken into account, eighteen times more. This difference renders some optimizations used in 2D (for example vector re-use) redundant in the 3D setting and the need for decrease in communication more critical, In a multi-threaded environment, the present invention provides evidence that: the use of custom memory managers per thread, the decrease of the cell and vertex size (member packing), and the reduced synchronization of the threads' Poor Element Lists (local PEL cell removal) yielded a 46% total reduction of TLB and LLC misses, and as a result a 30% total reduction of the execution time. For large core counts, the present invention shows that:

replicating structures across sockets and minimizing remote memory re-use (lightweight pools) achieved a 99,3% total reduction of Intel QFI accesses, and as a result a 73% total reduction of the execution time. The QPI remote accesses are also decreased by up to 60% when it used the

Hierarchical Work Stealing (HWS) approach, an approach that takes advantage of the unique NUMA architecture of Blackligh Moreover, HWS decreased the number of rollbacks by up to 63%, reducing in this way the communication between threads. Great decrease in the number of rollbacks is also achieved by the present invention's specially designed Contention Managers (CM). The local-CM can reduce the rollbacks by 85%, and the idle time by 96%, making in this way full use of the dissipated power,

It is worth noting that the PI2M according to the present invention exhibits excellent single- threaded performance, Despite the extra overhead associated with synchronization, contention management, and load balancing, the present invention generates meshes 30% faster than CGAL and with similar quality,

The present invention thus provides a flexible 12M infrastructure able to accept any refinement strategies with minimal code change.

SECTION 2 OPTIONAL EMBODIMENT

Dihedral angle improvement

For practical purposes, the issue of sliver removal and dihedral angle improvement may be addressed. One might apply the sliver exudation technique in order to improve the dihedral angles. However, it has been shown that in most cases sliver exudation does not remove all poor tetrahedra; elements with dihedral angles less than 5 ^ϋ survive. The random perturbation technique offers very small guarantees and sometimes requires many (random) trials for the elimination of a single sliver.

A straightforward and inexpensive way to eliminate slivers is to try to split them by inserting their circumcenter, it has been shown that this technique works when the silvers are far away from the mesh boundary. However, when slivers are close to the mesh boundary, the newly inserted points alter the boundary triangles, In fact, the boundary triangles might not have their vertices on the surface any more, or might not even belong to the restricted triangulation, In this section, a point rejection strategy is described that prevents the insertion of points which hurt fidelity.

The present approach first tries to convert illegal facets to legal ones, with legal facets defined to be those restricted facets whose vertices He precisely on δΩ,. Conversely, a restricted facet with at least one vertex not lying on 3Ω is called an illegal facet.

Let/^'be an illegal facet and e its voronoi edge (see Figure 21 (a)). Recall that e has to intersect di^'l (see Section 2) at a point p. Any vertex v of/ which do not lie precisely on d is deleted from the triangulation, while point p is inserted,

in addition, the present approach tries to keep in the Delaunay triangulation as many legal facets as possible. Let c be the circumcenter of a sliver considered for insertion, if the insertion of c eliminates a legal facet / (see Figure 21(b)), then c is not inserted, instead, a point p on the intersection of di and f ^'s voronoi edge e is inserted.

in summary, one may cope with slivers by optionally augmenting the method with the following two rules:

R6: If an illegal facet ^"' ppears, then all its vertices that do not lie on the surface are deleted, and a point p on Vor if) Γβ'Ω is inserted (See Figure 2 1 ( a)).

R7: Let i be a sliver and c its circumcenter, if c eliminates a legal facet f, then c is rejected. Instead , a point p on Not if) Π6Ώ is inserted (See Figure 12(b)).

Shown in Figure 21 are the point rejection strategies: (a) /is an illegal facet; (b)/is a legal facet.

Slivers are defined via the optimization metric ?/. Specifically, for a tetrahedron t, η(ί)= , where v is the volume of t, and I,· are the lengths of is edges, η was chosen because its computation is robust even when t is an almost flat element, It has been proved that 0 < η (£) < 1 . Moreover, η is 0 for a flat element, and I for a regular tetrahedron.

Tetrahedron t is considered to be a sliver if η(ί) is less than 0.06, The reason this value was chosen is because: (a) it introduces a small size increase (about 15%) over the mesh obtained without the present silver removal heuristic (he,, without rules R6 and R7), and (b) it introduces a negligible time overhead. In the Experimental section it is shown that this 0,06 bound corresponds to meshes consisting of tetrahedra with dihedral angles between 4.6 " and 171 ° ,

Note that. R6 and R7 never remove feature vertices: on the contrary, they might insert more to "protect" the surface. Hence, they do not violate Theorem 4: the mesh boundary continues being equal to the restricted Delaunay triangulation ϋϊδΩ (V ΓΊ ¾ ), and therefore a good approximation of the surface. In order not to compromise termination (and the guarantees are given for the radius-edge ratio and the boundary planar angles), if R6 or R.7 introduce an edge shorter than the shortest edge already present in the mesh, then the operation is rejected and the sliver in question is ignored.

Experimentally it. was found that the point rejection strategies were able to generate tetrahedra with angles more than 5⁰ and iess than 170° . Neither fidelity (see Theorem 4) nor termination (see Theorem 2) was compromised with this heuristic.

SECTION 3

An additional embodiment of the approach for sequential Delaunay refinement for smooth surfaces is presented herein. In this Section are outlined some of the aspects. Figure 22(a) shows a virtual box meshed into six tetrahedra, Figure 22(b) shows during refinement, the final mesh being gradually refine according to the Rules, Figure 22(e) shows, at the end, the set of tetrahedra! whose circumcenter lies inside © as the geometrically and topological!}' correct mesh M.

As is usually the case, it is assumed that the surface of the object dO to be meshed is a closed smooth 2-manifold, To prove that the boundary 3M of the final mesh M is geometrically and topo logically equivalent with dO, one may make use of a sample theory. Omitting the details, it can be proved that the Delaunay triangulation of a dense polntset lying precisely on the isosurface dO contains (as a subset) the correct mesh . That mesh consists of the tetrahedra t whose circumcenter c(r) lies inside 0. Formally, the sample theorem could be stated as follows:

Theorem .1. Lei V be samples of <3©. If for any point p £ dO„ there is a sample v 6 V

such that I v— p\≤ 8, then the boundary triangles of Dio (V) is a provably good topological approximation of SO. Also, the 2-sided Hausdorff distance between the mesh and dO is 0(δ").

Typical values for δ are usually fractions of the local feature size of dO. Other literature identifies well-defined 5 parameters. In this approach, δ values equal to multiples of the voxel size is sufficient.

Therefore, one of the goals of the refinement is to sample the isosurface densely enough. To achieve that, an algorithm may first constructs a virtual box which encloses O, The box is then triangulated into 6 tetrahedra, as shown in Figure 22, This is the only sequential part of the method. Next, it dynamically computes new points to he inserted into or removed from the mesh maintaining the Delaunay property, This process continues, until certain fidelity and quality criteria are met. Specifically, the vertices removed or inserted are divided into 3 groups: isosurface vertices, circumcenters, and surface-centers.

The isosurface vertices will eventually form the sampling of the surface so that Theorem 1 holds together with its theoretical guarantees about the fidelity of the mesh boundary. Let c(r) be the eircumcenter of a tetrahedron t, In order to guarantee termination, the present algorithm inserts the isosurface vertex which is the closest to c(t). in the sequel, the Closest isoSurface vertex of a point p is referred to as p E BO. The isosurface vertices (like the circumcenters) are computed during the refinement dynamically with the help of a parallel Euclidean Distance Transformation (EDT), Specifically, the EDT returns the surface voxel p which is closest to p. A surface-voxel is a voxel that lies inside the foreground and has at least one neighbor of different label. Then, one may traverse the ray pp'on small intervals and compute p dO by interpolating the positions of different labels. The density of the inserted isosurface vertices is defined by the user by a parameter 0^' > 0, A low value for 5 implies a denser sampling of the surface, and therefore, according to Theorem L a better approximation of dO.

The eircumcenter c(t) of a tetrahedron t is inserted when t has low quality (in terms of its radius-edge ratio) or because its circumradius r(t) is larger than a user-defined size function sf (·)■ Circumcenters might also be chosen to be removed, when they lie close to an isosurface vertex, because in this case termination is compromised,

Consider a facet /^' of a tetrahedron. The Voronoi edge V (f ^"} of / is the segment connecting the circumcenters of the two tetrahedra that contain / . The intersection V if ) ft dO is called a surface-center and is denoted by c_surf(f). During refinement, surface-centers are computed similarly to the isosurfaces (i.e., by traversing V (f) on small intervals and interpolating positions of different labels) and inserted into the mesh to improve the planar angles of the boundary mesh triangles and to ensure that the vertices of the boundary mesh triangles lie precisely on the isosurface.

In summary, tetrahedra and faces are refined according to the following Refinement Rules:

^» R.I : Let t be a tetrahedron whose cireumball intersects DO. Compute the closest

isosurface point z = c ft), if z is at a distance not closer

than δ to any other isosurface vertex, then z is inserted,

⁸ R2: Let t be a tetrahedron whose circumbail intersects dO. Compute the closest

isosurface point z = c ft). If its radius r(t) is larger than

2^■ δ, then c (t) is inserted.

* R3: Let f be a facet whose Voronoi edge V (f ) intersects dO at csurf (f ).

If either its smallest planar angle is less than 30° or a vertex of f is not

an isosurface vertex, then csurf (f ) is inserted.

^» R4: Let t be a tetrahedron whose circumeenter lies inside O, if its radius-edge ratio is larger than 2, then c (t) is inserted.

* R5: Let t be a tetrahedron whose circumeenter lies inside O, If its radius r(t) is larger than sf (c (t)), then c (t) is inserted.

« R.6: Let t be incident to an isosurface vertex z, All the already inserted circumcenters closer than 2δ to z are deleted.

Rules Rl and R2 are responsible for creating the appropriate dense sample so that the boundary triangles of the resulting mesh satisfies Theorem 1 and thus the fidelity guarantees. R3 and R4 deal with the quality guarantees, while R.5 imposes the size constraints of the users. R6 is needed so termination can be guaranteed. When none of the above rules applies, then refinement is complete. In previous work, the inventors showed that termination is guaranteed, the radius-edge ratio of ell elements in the mesh is less than 2, and the planar angles of the boundary mesh triangles is less than 30°.

Parallel Delaunay Refinement for Smooth Surfaces

As explained herein, before the mesh generation starts, the Euclidean Distance Transform (EDT) of the image is needed for the on-the-ily computation of the appropriate isosurface vertices. For this pre-processing step, one may make use of the publicly availabie parallel Maurer filter. It can be shown that this parallel EDT scales linearly with the respect to the number of threads. The rest of this section describes the main aspects of the present parallel code, Algorithm 1 (Figure 23) illustrates the basic building blocks of the present multi-threaded mesh-generation design. Note that the tightly-coupled paraileiization does not alter the fidelity (Theorem 1) and the quality guarantees described in the previous section.

Poor Element List (PEL)

Each thread Tt may maintain its own Poor Element List (PEL) PEL,, PEL* contains the tetrahedra that violate the Refinement Rules and need to be refined by thread Τ,· accordingly.

Operation

An operation that refines an element can be either an insertion of a point p or the removal of a vertex p. In the case of insertion, the cavity C (p) needs to be found and re-triangulated according to the well known Bovvyer- Watson kernel Specifically, C (p) consists of the elements whose circumsphere contains p. These elements are deleted (because they violate the Delaunay property) and p is connected to the vertices of the boundary of C (p) , in the case of a removal, the ball B (p) needs to be re-triangulated. This is a more challenging operation than insertion, because the re- triangulation of the bail in degenerate cases is not unique which implies the creation of illegal elements, i.e., elements that cannot be connected with the corresponding elements outside the ball. This difficulty is overcome by computing a local Delaunay triangulation (or D_s for brevity) of the vertices incident to p, such that the vertices inserted earlier in the shared triangulation are inserted into D₈ first. In order to avoid races associated with writing, reading, and deleting vertices/cells from a PEL or the shared mesh, any vertex touched during the operation of cavity expansion, or bail filling needs to be locked, GCC's atomic built-in functions are used for this goal, since they perform faster than the conventional pthread try__locks. Indeed, replacing p thread locks (the present first implementation) with GCC's atomic built-ins (current implementation) decreased the execution time by 3.6% on 1 core and by 4.2% on 12 cores.

In the case a vertex is already locked by another thread, then there is a rollback: the operation is stopped and the changes are discarded. When a rollback occurs, the thread moves on to the next bad element in its PEL.

Update new and deleted cells

After a thread T, completes an operation, new cells are created and some cells are invalidated, The new cells are those thai re-triangulate the cavity (in case of an insertion) or the bail (in case of a removal) of a point p and the invalidated cells are those that used to form the cavity or the ball of p right before the operation. T₍ determines whether a newly created element violates a rule, if it does, then Γ; pushes it back to PELi (or to another thread's PEL, see below) for future refinement, Also, Γ,· removes the invalidated elements from the PEL they have been residing in so far, which might be the PEL of another thread, To decrease the synchronization involved for the concurrent access to the PELs, if the invalidated cell c resides in another thread Tj 's PEL, , then T_s removes c from PELj only if Ί) belongs to the same socket with T, . Otherwise, 7^ raises cell c's invalidation flag, so that 7} can remove it when 7} examines c.

Load Balancer

Right after the tri angulation of the virtual box and the sequential creation of the first 6 tetrahedra, only the main thread might have a non-empty PEL, Clearly, Load Balancing is a fundamental aspect of the present implementation, A base (i.e., not optimized) Load Balancer is the classic Random Work Stealing RHW) technique, since it best fits the present implementation design, Asi optimized work stealing balancer has been implemented that takes advantage of the N^'UMA architecture and achieves an excellent performance,

if the poor element list PEL, of a thread T, is empty of elements, Γ,· "pushes back" its ID to the Begging List, a global array that tracks down threads without work, Then, 7,· is busy-waiting and can be awaken by a thread Ί) right after 7} gives some work to 2} . A running thread 7) , every time it completes an operation (i.e., a Delaunay insertion or a Delaunay removal), it gathers the newly created elements and places the ones that are poor to the PEL of the first thread , found in the begging list, The classification of whether or not a newly created cell is poor or not is done by Ύ. , Ί) also removes T_t from the Begging List,

To decrease unnecessary communication, a thread is not allowed to give work to threads, if it does not have enough poor elements in its PEL. Hence, each thread Τ_έ maintains a counter that keeps track of all the poor and valid cells that reside in PEL,- . 7'.· is forbidden to give work to a thread, if the counter is less than a threshold. That threshold is set equal to 5, since it yielded the best results.

When T. invalidates an element c or when it makes a poor element c not to be poor anymore, it decreases accordingly the counter of the thread that contains c in its PEL, Similarly, when T- gives extra poor elements to a thread, '; increases the counter of the corresponding thread, Contention Manager (CM)

In order to eliminate Iiveiocks caused by repeated rollbacks, threads talk to a Contention Manager (CM), Its purpose is to pause on run-time the execution of some threads making sure that at least one will do useful work so that system throughput can never get stuck. See Section 5 for approaches able to greatly reduce the number of rollbacks and yield a considerable speedup, even in the absence of enough parallelism. Contention managers avoid energy waste because of rollbacks and reduce dynamic power consumption, by throttling the number of threads that contend, thereby providing an opportunity for the runtime system to place some cores in deep low power states.

Contention Manager

The goal of the Contention Manager (CM) is to reduce the number of rollbacks and guarantee the absence of iiveiocks, if possible.

In this embodiment, fthe present contention techniques were implemented: the Aggressive Contention Manager (Aggressive-CM), the Random Contention Manager (Random-CM), the Global Contention Manager (Global-CM), and the Local Contention Manager (Local-CM).

The Aggressive-CM and Random~CM are non-blocking schemes, As is usually the ease for non-blocking schemes, the absence of Iiveiocks was not proven for these techniques. Nevertheless, they are useful for comparison purposes as Aggressive-CM is the simplest to implement, and

Random-CM has already been presented in the mesh generation literature,

The Global-CM is a blocking scheme and it is proven that it does not introduce any deadlock, (Blocking schemes are guaranteed not to introduce Iiveiocks),

The last one, Local-CM, is semi-blocking, that is, it has both blocking and non-blocking parts. Because of its (partial) non-blocking nature, it was found difficult to prove starvation-freedom, but it could guarantee absence of deadlocks and Iiveiocks, It should be noted, however, that the inventors have never experienced any thread starvation when using Local-CM: all threads in ail case studies made progress concurrently for about the same period of time .

Note that none of the earlier Transactional Memory techniques and the Random Contention Managers presented in the past solve the liveiock problem. In this section, it is shown that if Iiveiocks are not provably eliminated in the present application, then termination is compromised on h igh core counts.

For the next of this Section assume that (without loss of generality) each thread always finds elements to refine in its Poor Element List (PEL). This assumption simplifies the presentation of this Section, since it hides several details that are mainly related to Load Balancing. The interaction between the Load Balancing and the Contention Manager techniques does not invalidate the proofs of this section.

Aggressive-CM

The Aggressive-CM is a brute-force technique, since there is no special treatment, Threads greedily attempt to apply the operation, and in case of a rollback, they just discard the changes, and move on to the next poor element to refine (if there is any). The purpose of this technique is to show that reducing the number of rollbacks is not just a matter of performance, but a matter of correctness. Indeed, experimental evaluation (see discussion elsewhere herein) shows that Aggressive-CM very often suffers from liveiocks. andom-CM

Randon -CM has already been presented (with minor differences) in the literature and worked fairly well, i.e, no liveiocks were observed in practice. This scheme lets "randomness" choose the execution scenario that would eliminate liveiocks. This technique is implemented as well to show that the present application needs considerably more elaborate CMs. Indeed, recall that in this case, there is no much paralielism in the beginning of refinement and therefore, there is no much randomness that can be used to break the liveiock.

Each thread . counts the number of consecutive rollbacks r_f- . If r₍ exceeds a specified upper value r⁺. then F, sleeps for a random time interval t_t , If the consecutive rollbacks break because an operation was successfully finished then n is reset to 0. The time interval t_t is in milliseconds and is a randomly generated number between 1 and r⁺ , The value of r⁺ is set to 5 , Other values yielded similar results. Note that lower values for r⁺ do not necessarily imply faster executions. A low r⁺ decreases the number of rollbacks much more, but increases the number of times that a contented thread goes to sleep (for ί_έ milliseconds). On the other hand, a high r^÷ increases the number of rollbacks, but randomness is given more chance to avoid liveiocks; that is, a contented thread has now more chances to find other elements to refine before it goes to sleep (for f _έ milliseconds),

Random-CM cannot guarantee the absence of liveiocks. This randomness can rarely lead to liveiocks, but it should be rejected as it is not a valid solution. It was also experimentally verified that liveiocks are not that rare (see discussion elsewhere herein).

GIobal-CM GIobai-CM maintains a global Contention List (CL), If a thread '/'; en-counters a rollback, then it writes its id in CL and it busy waits (i.e., it. blocks). Threads waiting in CL are potentially awaken (in FIFO order) by threads that have made a lot of progress, or in other words, by threads that have not recently encountered many rollbacks. Therefore, each thread ^computes its "progress" by counting how many consecutive successful operations s_;- have been performed without an interruption by a rollbacks. If s_{ exceeds a upper value s⁺, then T₍ awakes the first thread in CL, if any. The value for 5 is set to 10. Experimentally, this was found to value yield the best results.

Giobal-CM can never create Iivelocks. because it is a blocking mechanism as opposed to random -CM which does not block any thread. Nevertheless, the system might end up to a deadlock, because of the interaction with the Load Balancing's Begging List BL (see the Load Balancer discussion above),

Therefore, at any time, the number of active thread needs to be tracked down, that is, the number of threads that do not busy_wait in either the CL or the Begging List. A thread Is forbidden to enter CL and busy_wait, if it sees that there is only one (i.e., itself ) active thread; instead, it skips CL and attempts to refine the next element in its Poor Element List. Similarly, a thread about to enter the Begging List (because it has no work to do) cheeks whether or not it is the only active thread at this moment, in which case, it awakes a thread from the CL, before it starts idling for extra work. In this simple way, the absence of iivelocks and deadlocks are guaranteed, since threads always block in case of a rollback and there will always be at least one active thread. The disadvantage of this method is that there is global communication and synchronization: the CL, and the number of active threads are global structures/variables that are accessed by all threads.

Local-CM

The local Contention Manager (iocai-CM) distributes the previously global Contention List (CL) across threads. The Contention List CL, of a thread Γ,· contains the ids of threads that encountered a rollback because of T, (i.e, they attempted to acquire a vertex already acquired by 7,· ) and now they busy wait, As above, iff, is doing a lot of progress, i.e., the number of consecutive successful operations exceed s+ , then T. awakes one thread from its local CL; .

Extra care should be taken, however, to guarantee not only the absence of Iivelocks, but also, the absence of deadlocks, It is possible that Tj encounters a rollback because of T₂ (this relationship is symbolized by writing T}→ T₂ and T₂ encounters a rollback because of Ti (i.e., T₂→ T[): both threads write their ids to the other thread's CL. and no one else can wake them up. Clearly, this dependency cycle (T_t ^» T₂ -→ Tj ) leads T[ and T₂ to a deadlock, because under no circumstances will these threads ever be awaken again.

To solve these issues, each thread is now equipped with two extra variables: conflicting Jd and busy_wait. See Figure 24 for a detailed pseudo-code of iocai-CM.

The algorithm in Figure 24 is called by a Tj every time it does not finish the operation successfully (i.e., it encounters a rollback),- Suppose T, attempts to acquire a vertex already locked by Tj (T_j → Ί) ). In this ease, T, does not complete the operation, but rather, it rollbacks by disregarding the so far changes, unlocking all the associated vertices, and finally executing the Ro!Jback__Oecured function, with conflict gjd equal to j. In other words, the conflicting id variables represent dependencies among threads: F_{ → 7} <≠>Tj ,conflicting_id ~j.

For example, if Γ,· encounters a rollback because of 7) and 7} encounters a rollback because of T_k , then the dependency path from Γ, is Γ_£· → 7} → T_k , which corresponds to the following values: T_/.conflicting Jd =j, 7},confll3cting_id = k, T_k .conflicting id = - ! (where -1 denotes the absence of dependency).

Lines 4-1.4 of Roilhack_Oceun;ed decide whether or not T_/ .should block (via busy- waiting), T_t is not. allowed to block if T_conflicting id has already decided to block (Lines 6-10), Threads communicate their decision to block by setting their busy_wait flags to true. If Tconf etmgjd .busy_wait has already been set to true, it is imperative that T_f- is not allowed to block, because it might be the case that the dependency of T forms a cycle. By not letting ,· to block, the dependency cycle "breaks". Otherwise, Ti writes its id to CL_confiiciin id (Lines 15-1 7) and loops around its busy_wait flag (Line 1 8),

The algorithm in Figure is called by a Γ, every time it completes an operation, i.e., every time Ti does not encounter a rollback. If Γ, has done a lot of progress (Lines 2-5 of

Rollback Not Occurred), then it awakes a thread 7) from its Contention List CL,- by setting Tj 's busy__wait flag to false. Therefore, 7} escapes from the loop of Line 1 in Rollback Occurred and is free to attempt the next operation.

Figure 25 illustrates possible execution scenarios for iocai-CM during six Time Steps. Below, is described in detail what might happen in each step:

• Time Step 1: All fthe present threads are making progress without any roll-backs.

Time Step 2: Ti and T4 attempted to acquire a vertex already owned by T₂ , Both T; and T₄ call the code of Figure 24. Their conflicting__id variables represent, those exact dependencies (Line 3 of Rollback Occurred).

* Time Step 3: 7_/ and T set their busy_wait flag to true (Line 12 of Rollback Occurred), they write their ids to CL (Lines 15-17), and they block via a busy__wait (Line 18).

* Time Step 4: T₂ has done lots of progress and executes the Lines 6-9 of

Rollback_Not_Occurred, awaking in this way 2

* Time Step 5: A dependency cycle is formed: T₂ ^→T₃ →T₄ ^→7j . Lines 4-14 of Rollback Occurred will determine which threads block and which ones do not. Note that the rautex locking of Lines 4-5 cannot be executed at the same time by these 3 threads, Only one thread can enter its critical section (Lines 6-14) at a time.

Time Step 6: Here it is shown that TV executed its critical section first, 7½ executed its critical section second, and T₃ was last. Therefore, TV and TV block, since the condition in Line 6 was false: their conflicting threads at that time had not set their busy_wait to true, The last thread T_s realized that its conflicting thread TV has already decided to block, and therefore, T_$ returns at Line 10, without blocking.

No te that in Time Step 6, T₂ blocks without awaking the threads in its CL, and that is why both CL₂ and CL₃ are not empty. It might be tempting to instruct a thread Ί) to awake all the threads in CL, , when 7} is about to block. This could clearly expedite things. Nevertheless, such an approach could easily cause a livelock as shown in Figure 14,

Local-CM is substantially more complex than global-CM, and the deadlock- free/livelock-free guarantees are not very intuitive. The rest of this Subsection is devoted to prove that locai-CM indeed can never introduce deadlocks or iiveiocks,

As shown in Figure 14, a thread is about to busy-wait on another thread's Contention List (CL) should not awake the threads already in its own CL. Otherwise, a livelock might happen, as illustrated in this figure, Time Step 8 leads the system to the same situation of Time Step 1 : this can be taking place for an undefined period of time with none of the threads making progress.

The following two Remarks follow directly from the definition of deadlock and livelock.

Remark I, If a deadlock arises, then there has to be a dependency cycle where all the participant threads block. Only then these blocked threads will never be awakened again. Remark 2, if a iivelock arises, then there has to be a dependency cycle where ail the participant threads are not blocked. Since all the participant threads break the cycle without making any progress, this "cycle breaking" might be happening indefinitely without completing any operations. In the only case where the rest threads of the system are blocked waiting on these participant threads' Contention Lists (or all the system's threads participate in such a cycle), then system- wide progress is indefinitely postponed.

The next Lemmas prove that in a dependency cycle, at least one thread will block and at least one thread will not block, This is enough to prove absence of deadlocks and livelocks.

Lemma 1 (Absence of deadlocks). In a dependency cycle at least one thread will not block.

Proof. For the sake..of contradiction, assume that the. threads, _{ί τ} T_t , . . . , _j participate lEa cycle, that is ff — ^» T_t >T₍ ^→T , such that all threads block. This means that ail threads evaluated Line 6 of Figure 24(c) to false. Therefore, since T_t 's conflicting id is F_£„ , right before decides to block (Line 12), T 's busy_wait flag was false, The same argument applies for ail the pairs of consecutive threads: {T_t , F_j }, {F_f , _t }, . . . , _i , T_t }. But F^_, could not have evaluated Line 6 to false, because, by the present assumption, f had already decided to block and _{

.busy_wait had been already set to true when T₍ acquired T_t 's mutex. A contradiction; ^ returns from Roilback_ Occurred without blocking.

Lemma 2 (Absence of livelocks). In a dependency cycle at least one thread will block.

Proof. For the sake of contradiction, assume that the threads F_f , F_iz , . . . . F₍ participate in a cycle, that is, F; →T_t → - · - →T_ltlt ^→T , such that ail threads do not block. This means that ali threads evaluated Line 6 of Figure 24(c) to true. Consider for example T , When T_it acquired T_i% 's mutex, it evaluated Line 6 to true. That means that 7₍ had already acquired and released its mutex having executed Line 12: a contradiction because T_t blocks.

Comparison For this case study, each CM on the CT abdominal atlas of IRC AD Laparoscopic Center was evaluated using 128 and 256 BSaeklight cores (see Figure 16, Table 2 for its specification). The final mesh consists of about 150 X 106 tetrahedra, The single-threaded execution time on Blaeklight was 1 ,080 seconds. See Figure 15, Table 1.

There are three direct sources of wasted cycles in the present algorithm, and all of them are shown in Table I of Figure 15,

» contention overhead time: it is the total time that threads spent busy-waiting on a Contention

List (or busy-waiting for a random number of seconds as is the case of Random-CM) and accessing the Contention List (in case of Global-CM),

® load balance overhead time: it is the total time that threads spent busy- waiting on the

Begging List waiting for more work to arrive (see discussion elsewhere herein) and accessing the Begging List, and

* rollback overhead time: it is the total time that threads had spent for the partial completion of an operation right before they decided that they had to discard the changes and roll back.

Observe that Aggressive-CM was stuck in a livelock on both 128 and 256 cores. It is known for sure that these were livelocks because it was found out that no tetrahedron was refined, i.e., no thread actually made any progress, in the time period of an hour.

Random-CM terminated successfully on 128 cores, but it was very slow compared to Global- CM and Local-CM. Indeed, Random-CM exhibits a large number of rollbacks that directly increases both the contention overhead and the rollback overhead. Also, since threads' progress is much slower, threads wai for extra work for much longer, a fact that also increases the load balance overhead considerably. As explained above, Random-CM does not eliminate livelocks, and this is manifested on the 256 core experiment, where a livelock did occur.

On both 128 and 256 cores, Local-CM performed better. Indeed, observe that the total overhead time is approximately twice as small as Global-CM' s overhead time, This is mainly due to the little contention overhead achieved by Local-CM, Since Global-CM maintains a global Contention List, a thread 7) waits for more time before it gets awaken from another thread for two reasons: (a) because there are more threads in front of 7} that need to be awaken first, and (b) because the

Contention List and the number of active threads are accessed by all threads which causes longer communication latencies. Although Local-CM is the fastest scheme, observe that it introduces higher number of rollbacks on 256 cores than G!obai-CM, This also justifies the increased rollback overhead (see Figure 15, Table 1 (b)). In other words, fewer rollbacks do not always imply faster executions, a fact that renders the optimization the application a challenging task. This result can be explained by the following observation: the number of rollbacks (and subsequently, the rollback overhead) and the contention overhead constitute a tradeoff. The more a thread waits in a Contention List, the more its contention overhead is, but the fewer the rollbacks it encounters are, since it does not attempt to perform any operation, Conversely, the less a thread waits in a. Contention List, the less its contention overhead is, but since it is given more chances to apply an operation, it might encounter more rollbacks. Nevertheless, Figure 15, Table 1 suggests that Local-CM does a very good job balancing this tradeoff on runtime.

Although there are other elaborate and hybrid contention techniques, none of them guarantees the absence of Hveiocks. Therefore, Local-CM was chosen because of its efficiency and correctness.

Performance

in this Section is described a load balancing optimization, and presented the strong and weak scaling performance on Blacklight. See Figure 1 6, Table 2 for its specifications.

Hierarchical Work Stealing (HWS)

in order to further decrease the communication overhead associated with remote memory accesses, a Hierarchical Work Stealing scheme (HWS) was implemented by taking advantage of the cc-NUMA architecture.

The Begging List was re- rganized into three levels; BLI , BL2, and BL3. Threads of a single socket that run out of work place themselves into the first le vel begging list BL1 which is shared among threads of a single socket. If the thread realizes that all the other socket threads wait on BL1 , it skips BL1, and places itself to BL2, which is shared among threads of a single blade. Similarly, if the thread realizes that BL2 already accommodates a thread from the other socket in its blade, it asks work by placing itself into the last level begging list BL3. When a thread completes an operation and is about to send extra work to an idle thread, it gives priority to BLI threads first, then to BI.,2, and lastly to BL3 threads. In other words, BLI is shared among the threads of a single socket and is able to accommodate up to number of threads per socket ^~ 1 idle threads (in Blacklight, that is 7 threads), BL2 is shared among the sockets of a single blade and is able to accommodate up to number of sockets per blade - 1 idle threads (in Biaeklight, that, is I thread). Lastly, BL3 is shared among all the allocated blades and can accommodate at most one thread per blade, in this way, an idle thread tends to take work first from threads inside its socket, if there is none, Γ, takes work from a thread of the other socket inside its blade, if any. Finally, if all the other threads inside Ij's blade are idling for extra work, T, places its id to BL3, asking for work from a thread of another blade.

Figure 17 shows the strong scaling experiment demonstrating both the Random Work Stealing (R.WS) load balance and the Hierarchical Work Stealing (HWS). The input image used is the CT abdominal atlas obtained from IRCAD Laparoscopic Center. The final mesh generated consists of 124 X 10⁶ elements. On a single Biaeklight core, the execution time was 1100 seconds.

Observe that the speed-up of R.WS deteriorates by a lot for more than 64 cores (see the RWS line in Figure 17(a)). In contrast, HWS manages to achieve a (slight) improvement even on 176 cores. This could be attributed to the fac that the number of inter-blade (i.e,, remote) accesses are greatly reduced by HWS (see Figure 1 (b)), since begging threads are more likely to get poor elements created by threads of their own socket, and blade first. C!early, this reduces the communication involved when a thread reads memory residing in a remote memory bank. Indeed, on 176 cores. 98,9% of all the number of times threads asked for work, they received it from a thread of their own blade, yielding a 28.8% reduction in inter-blade accesses, as Figure 17(b) shows.

Figure 17(c) shows the breakdown of the overhead time per thread for HWS across runs. Note that since th is is a strong scaling case study, the ideal behavior is a linear increase of the height of the bars with the respect to the number of threads. Observe, however, that the overhead time per thread is always below the overhead time measured on 16 threads. This means that Local-CM and the Hierarchical Work Stealing method (HWS) are able to serve threads fast and tolerate congestion efficiently on runtime.

Weak Scaling Results

In this section is presented the weak scaling performance of PJ2M on the CT abdominal atlas arid the knee atlas. Other inputs exhibit very similar results on comparable mesh sizes.

The number of tetrahedra created per second across the runs was measured. Specifically, define Elements («) and Time (?/), the number of elements created and the time elapsed, when n threads are empioved. Then, the speedup is defined as ^⁶~·^.·^·-·— ·~·^. With n threads, a perfect speedup would be equal to n.

One can directly control the size of the problem (i.e., the number of generated tetrahedra) via the parameter δ (see discussion elsewhere herein). This parameter sets an upper limit on the volume of the tetrahedra generated, With a simple volume argument, one can show that a decrease of δ by a factor of x results in an r' times increase of the mesh size, approximately.

See Figure 18, Table 3. Each reported Time is computed as the average among three runs. Although the standard deviation for up to 128 cores is practically zero on both inputs, the same does not apply for higher core counts, indeed, the standard deviation on the 144-, 160-, and 176~core executions is about 10, 15, and 29 seconds respectively, for both inputs. This behavior is attributed to the fact thai in those experiments, the network switches responsible for the cache coherency were close to the root of the fat-tree topology and therefore, they were shared among many more users, affecting in this way the timings of the present application considerably. (Note that the increased bandwidth of the upper level switches does not alleviate this problem, since the bottleneck of the present application is latency.) This conjecture agrees with the fact that the the maximum number of hops on the experiments for up to 128 cores was 3, while for 144, 160 and 176 cores, this number became 5.

Nevertheless, observe the excellent speedups for up to 128 threads. On 144 cores, an unprecedented efficiency of more than 82% was achieved, and a rate of more than 14.3 Million Elements per second for both inputs. It is worth mentioning that CGAL, the fastest sequential publicly available Isosurface-based mesh generation too!, on the same CT image input, is 81% slower than the present single-threaded performance, indeed. CGAL took 548,21 seconds to generate a similarly-sized mesh (1 ,00 X 107 tetrahedra) with comparable quality and fidelity to the presents (see Section 7 for a more thorough comparison case study). Thus, compared to CGAL, the speedup achieved on 144 cores is 751.25,

Observe, however, that this performance may deteriorate beyond this core count. It is believed that a main reason of this degradation is not the overhead cycles spent on rollbacks, contention lists, and begging lists (see discussion elsewhere herein), but the congested network responsible for the communication.

As may be seen in Figur 19, the overhead time breakdown with respect to the wall time tor the experiment on 176 cores of Figure 18, Table 3(a). A pair (x, y) tells that up to the x* second of execution, threads have not been doing useful work so far for y seconds ail together.

First, note that the total overhead time per thread increases. Since this is a weak scaling case study, the best that can happen is a constant number of overhead seconds per thread. But this is not happening. The reason is that in the beginning of refinement, the mesh is practically empty: only the six teirahedra needed to fill the virtual box are present (see Figure 22), Therefore, during the early stages of refinement, the problem does not behave as a weak scaling case study, but as a strong sealing one: more threads, but in fact the same size, which renders the present application a very challenging problem, See Figure 19 for an illustration of the 176-core experiment of Figure 18, Table 3(a), X-axis shows the wall-time clock of the execution. The Y-axis shows the total number of seconds that threads have spent on useless computation (i.e., rollback, contention, and load balance overhead, see Section 5.5) so far, cumulatively. The more straight the lines are, the more useful work the threads perform. Rapidly growing lines imply lack of parallelism and intense contention. Observe that in the first 14 seconds of refinement (Phase 1 ), there is high contention and severe load imbalance.

Nevertheless, even in this case, » 73%of the time, all 176 threads were doing useful work, i.e., the threads were working on their full capacity.

However, this overhead time increase cannot explain the performance deterioration. See for example the numbers on 176 threads of Figure 1 8, Table 3(a). 176 threads run for 181.10s each, and, on the average, they do useless work for 10,55s each, in other words, if there were no rollbacks, no contention list overhead, and no load balancing overhead, the execution time would have to be 1 8 ! , 1GS-1Q.55S =170.55s. ] 70,55s, however, is far from the ideal 90,37s (that the first column with 1 thread shows) by i 7G.55s-90.37=80.18s. Therefore, while rollbacks, contention management, and load balancing introduce a merely 10,55s overhead, the real bottleneck is the 80.18s overhead spent on memory (often remote) loads/stores, indeed, since the problem size in- creases linearly with respect to the number of threads, either the communication traffic per network switch increases across runs, or it goes through a higher number of hops (each of which adds a 2,000 cycle latency penalty), or both. It seems that after 144 cores, this pressure on the switches slows performance down. A hybrid approach able to scale for larger network hierarchies is left for future work.

Hyper-threading

Figure 26, Table 4 shows the performance achieved by the hyper-threaded version of the present code. For this ease study, the same input and parameters were used as the ones used in the experiment shown in Figure 18, Table 3(a). The only difference is that now there are twice as many threads as there were in that Table 3(a).

Since the hardware threads share the TLB, the cache hierarchy, and the pipeline, reported are the impact of hyper- threading on TLB misses, Last Level Cache (LLC) misses, and Resource stall cycles. Specifically, reported is the increase of those counters relatively to the non hyper-threaded experiment of Figure 18, Table 3(a). The reported Speedup is also relative to the non hyper-threaded experiment.

The last three rows of Figure 26, Table 4 suggest that the hyper-threaded version utilized the core resources more efficiently. Surprisingly enough, the TLB and LLC misses actually decrease (notice the negative sign in front of the percent- ages) when two hardware threads are launched per core. Also, as expected, the pipeline in the hyper-threaded version is busier executing micro-ops, as the decrease of resource stall cycles suggest.

Although hyper-threading achieves a better utilization of the TLB, LLC, and pipeline, there is a considerable slowdown after 64 cores (i.e., 128 hard- ware threads). Observe that hyper-threading expedited the execution for up to 64 cores, indeed, the hyper-threaded version is 47% faster on 64 cores compared to the non hyper- threaded version. Beyond this point, however, there is a considerable slowdown. This slowdown cannot be explained by merely the increase in the number of overhead seconds.

See for example the overhead sees per thread on 176 cores in Figure 26, Table 4. It is indeed 13 times higher than its non hyper-threaded counterpart; this is, however, expected because the size of the problem is the same but now twice as many hardware threads are used as before. If one subtracts the overhead time of the hyper-threaded version on 176 cores, one gets that for 480,83s - 143.37s = 337.46s, all hardware threads were doing useful work. But this is still way longer than the 181.10s - 10.55s - 170,55s useful seconds of the non hyper-threaded execution (see Figure 18, Table 3(a)).

This behavior is attributed to the increased communication traffic caused not by the increased problem size (as was mostly the case in the non hyper- threaded version), but by the increased number of "senders" and "receivers". That is, even though the problem size is the same, the hyper-threaded version utilizes more threads, This means that at a given moment, there will be more packages (originated by the more than before threads) in the switches waiting to be routed than before. This phenomenon increases the communication latency. It seems that the network cannot handle this pressure for more than 64 cores, or equivalent!}', 128 hardware threads. Note that this agrees with the fact that in the non hyper-threaded version, the slowdown occurred on more than 128 cores, which is again 128 threads (see Figure 18, Table 3(a)).

Single-threaded evaluation

in this section, it is shown that the single-threaded execution time of this method (PI2M), although it introduces extra overhead due to locking, synchronization, contention management bookkeeping (see discussion elsewhere herein), and hierarchical load balance (see discussion elsewhere herein), it is faster than CGAL and TetGen, the state-of-the-art open source mesh generation tools. Moreover, ΡΪ2Μ has comparable quality with CGAL and much better quality than TetGen, PI2M, CGAL, and TetGen are very robust Delaunay methods, since they all use exact predicates. Specifically, PI2M adopts the exact predicates as implemented in CGAL.

It should be mentioned that although CGAL is able to operate directly on segmented multi- tissue images (i.e., it is an Isosurface-based method), TetGen is a PLC-based method (see discussion elsewhere herein). That is, TetGen's inputs are triangulated domains that separate the different tissues. For this reason, pass to TetGen the triangulated isosurfaces as recovered by this method, and then let TetGen to fill the underlying volume,

PI2M, CGAL, and TetGen were run on two different multi-tissue 3D input images obtained from a Hospital Surgical Planning Laboratory. The first is an MR. knee-atlas and the second is a CT head-neck atlas, information about these two inputs is displayed in Figure 27, Table 5. The resulting output meshes generated by the present method PI2M are illustrated in Figure 28.

Figure 29, Table 6 shows timing and quality statistics for P12 , CGAL, and TetGen. CR.TC (see Figure 16, Table 2 for its specifications) was used for this ease study. The timings reported account for everything but for disk 10 operations. The execution time reported for ΡΪ2Μ incorporates the 1 ,9 seconds and 1.2 seconds time interval needed for the computation of the Euclidean distance transform (see discussion elsewhere herein) for the knee atlas and the head-neck atlas, respectively.

The sizing parameters of CGAL and TetGen were set to values that produced meshes of similar size to those here, since generally, meshes with more elements exhibit better quality and fidelity, The achieved quality of these methods were assessed in terms of radius-edge ratio and dihedral angles. Those metrics are of importance, as they are shown to improve the speed and robustness of medical application solvers dealing with isotropic materials. Ideally, the radius-edge ratio should be low, the minimum dihedral angle should be large, and the maximum dihedral angle should be low. The smallest boundary planar angles are also reported. This measures the quality of the mesh boundary, Large smallest boundary planar angles imply better boundary quality.

PI2M, CGAL, and TetGen allow users to specify the target radius-edge ratio. Apart from TetGen, these methods also allow users to specify the target boundary planar angles. The corresponding parameters were set accordingly, so that the maximum radius-edge ratio is 2 (for PI2M, CGAL, and TetGen), and the smallest boundary planar angle is more than 30° (for PI2M and CGAL only, since TetGen does not give this parameter). Fidelity measures how well the mesh boundary represents the isosurfaces. The fidelity achieved by these methods is assessed in terms of the symmetric (double-sided) Hausdorff distance. A low Hausdorff distance implies a good representation. The Hausdorff distance for TetGen is not reported since is the triangular mesh that represents the isosurfaces is given to it as an input.

The speed of the methods above was assessed by comparing the rate of generated tetrahedra per second. Note that since the present method not only inserts but also removes points from the mesh (thus reducing the number of mesh elements), a perhaps fairer way to assess speed is to compare the rate of performed operations per second. Nevertheless, this metric is note reported for two reasons. First, a high rate of operations does not always imply a high rate of generated tetrahedra. The later, however, is the only thing that matters, since comparing the quality/fidelity achieved by meshes of very different mesh sizes makes no sense. Second, the number of removals performed by P32M accounts for only 2% over the total number of operations. Thus, the rate of generated tetrahedra is very close the rate of operations per second; indeed, and it was experimentally found out that those two rates are practically the same.

Observe that the PI2M and CGAL generate meshes of similar dihedral angles, and fidelity, but the present method is much faster, indeed, the rate of the single-threaded PI2M is 68.7% higher than CGAL on the knee atlas and more than 3 times higher on the head-neck atlas. Also note that both P12M and CGAL prove that the smallest boundary planar angles are more than 30= and that radius- edge ratio is less than 2. Due to numerical errors, however, these bounds might be smaller in practice than what theory suggests. Nevertheless, observe that PI2M yields much better boundary planar angles and radius-edge ratio than CGAL on the head-neck atlas,

TetGen is faster than PI2M only on the knee atlas by a couple of seconds. For larger meshes (as is the case with the head-neck atlas), TetGen is slower, Indeed, tor small meshes, the computation of the Euclidean Distance Transform (EDT) accounts for a considerable percentage over the total execution time, a fact that slows down the overall execution time by a lot. For example, the actual meshing time on the knee atlas was just 4.6 sees, ver close to TetGen' s time and rate. Another notable observation is thai this method generates meshes with much better dihedral angles and radius- edge ratio than TetGen. The achieved boundary planar angles are similar simply because the PLC' that is given to TetGen was in fact the triangular boundary mesh of ΡΪ2Μ.

Discussion, Conclusions

Disclosed is an embodiment of ΡΪ2Μ: the first parallel Iniage-to-Mesh (P12M) Conversion Isosurface-based algorithm and its implementation, Starting directly from a multi-label segmented 3D image, it. is abie to recover and mesh both the isosurface do with geometric and topological guarantees (see Theorem 1) and the underlying volume © with quality elements.

This work is different from parallel Triangulator,, since parallel mesh generation and refinement focuses on the quality of elements (tetrahedra and facets) and the conformal representation of the tissues' boundaries/isosurfaces by computing on demand the appropriate points for insertion or deletion. Parallel Trianguiators tessellate only the convex hull of a set of points,

The present tightly-coupled method greatly reduces the number of rollbacks and scales up to a much higher core count, compared to the tightly-coupled method the present group developed in the past. The data decomposition method does not support Delaunay removals, a technique that is shown to be effective in the sequential mesh generation literature. The extension of partially-coupled and decoupled methods to 3D is a very difficult task, since Delaunav-admissible 3D medial decomposition is an unsolved problem. On the contrary, the present method does not rely on any domain decomposition, and could be extended to arbitrary dimensions as well, indeed, it is planned to extend PI2M to 4 dimensions and generate space- time elements (needed for spatio-temporal simulations) in parallel, thus, exploiting parallelism in the fourth dimension.

The present approach is highly optimized through carefully designed contention managers, and load balancers which take advantage of NUMA architectures. Aspects of the Global Contention Manager (Global-CM) and Local Contention Manager (Local-CM) provably eliminate deadlocks and livelocks. They achieve a speedup even on 256 cores, when other traditional contention managers, found in the mesh generation literature, fail to terminate. Local-CM also reduced the number of overhead cycles by a factor of 2 compared to the Global-CM on 256 cores, improving energy- efficiency by avoiding energy waste because of rollbacks. Lastly, the Hierarchical Work Stealing load balancer (HWS) sped up the execution by a factor of 1.45 on 176 cores, as a result of a 22.8% remote accesses reduction.

All in all, PI2M achieves a strong scaling efficiency of more than 82% on 64 cores. It also achieves a weak scaling efficiency of more than 82% on up to 144 cores. The inventors are not aware of any 3D parallel Delaunay mesh refinement algorithm achieving such a performance.

it is worth noting that PI2M exhibits excellent single-threaded performance. Despite the extra overhead associated with synchronization, contention management, and load balancing, PI2M generates meshes 40% faster than CGAL and with similar quality. Moreover, P12M achieves better quality than TetGen, and it is also faster than TetGen for large mesh sizes.

Recall that in the present method, threads spend time idling on the contention and load balancing lists. And this is necessary in the present algorithm for correctness and performance efficiency. This fact offers great opportunities to control the power consumption, or alternatively, to maximize the ^i^^™- ratio. Since idling is not the time critical component, the CPU frequency could be decreased during such an idling.

As explained, for core counts higher than 144, weak scaling performance deteriorates because communication traffic (per switch) is more intense and passes through a larger number of hops. In the future, scalability may be increased by employing a hierarchically layered (distributed and shared memory) implementation design and combine this tightly-coupled method with the decoupled and partially coupled methods.

While the invention has been described and illustrated with reference to certain preferred embodiments herein, other embodiments are possible. Additionally, as such, the foregoing illustrative embodiments, examples, features, advantages, and attendant advantages are not meant to be limiting of the present invention, as the invention may be practiced according to various alternative embodiments, as well as without necessarily providing, for example, one or more of the features, advantages, and attendant advantages that may be provided by the foregoing illustrative

embodiments.

Systems and modules described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.

Software and other modules may reside on servers, workstations, personal computers, computerized tablets, PDAs, and other devices suitable for the purposes described herein. Software and other modules may be accessible via local memory, via a network, via a browser or other application in an ASP context, or via other means suitable for the purposes described herein. Data structures described herein may comprise computer files, variables, programming arrays, programming structures, or any electronic information storage schemes or methods, or any combinations thereof, suitable for the purposes described herein. User interface elements described herein may comprise elements from graphical user interfaces, command line interfaces, and other interfaces suitable for the purposes described herein. Except to the extent necessary or inherent in the processes themselves, no particular order to steps or stages of methods or processes described in this disclosure, including the Figures, is implied. In many cases the order of process steps may be varied, and various illustrative steps may be combined, altered, or omitted, without changing the purpose, effect or import of the methods described.

Accordingly, while the invention has been described and illustrated in connection with preferred embodiments, many variations and modifications as will be evident to those skilled in this art may be made without departing from the scope of the invention, and the Invention is thus not to be limited to the precise details of methodology or construction set forth above, as such variations and modification are intended to be included within the scope of the invention. Therefore, the scope of the appended claims should not be limited to the description and illustrations of the embodiments contained herein.

Claims

CLAIMS:

WHAT IS CLAIMED IS:

1. A method for generating mesh from an image, the method being implemented by a computer system operatively connected to at least one data storage device for storing image data for images, at least one computer, and at least one computer readable medium storing thereon computer code which when executed by the at least one computer performs the method, the method comprising: receiving a digital image of an object © having a volume and a surface dO; applying a virtual box of eight vertices to the image of object © and triangulating the vertices into a plurality of tetrahedral ; processing a parallel real-time Image-to-Mesh conversion using an isosurface based algorithm (PI2M) for processing the digital image; and wherein the system is configured to process the digital image without surface extraction preprocessing,

2. The method according to claim 1 , wherein the method comprises: recovering boundaries for the digital image; and generating meshes through a sequence of dynamic insertion and deletion of points in a Delaunay fashion until circumcenters of all tetrahedra t lie inside ©.

3. The method according to claim 1 , wherein the image is a three-dimensional medical image,

4. The method of claim 2, wherein the processing of a parallel real-time Image-to-Mesh conversion is by providing a plurality of parallel processing threads T_h each of threads 7} having a poor element list of tetrahedrons t requiring refining.

5. The method of claim 2, wherein the meshes comprise a volume mesh and an isosurfaee mesh of the object 0 derived from the image and not from a polyhedral domain.

6, The method of claim 5, wherein δ is a predetermined sampling density, the meshes include a plurality of tetrahedrons t,f is a facet, and wherein the dynamic insertion and deletion of points in a Delaunay fashion comprises refining the meshes by; inserting z if z is at a distance not closer than δ to any other isosurfaee vertex, for intersecting tetrahedron i and wherein z is equal to cfp (c (t));

inserting c(t) if radius r(t) > 2 · δ for intersecting tetrahedron t and wherein z is equal to cfp (c

(0);

inserting c _sulf (/) wherein / is a restricted facet and either a smallest planar angle of is less than 30° or a vertex of is not a feature vertex;

inserting e(t) for interior tetrahedron t if r(r) > a user defined size function sf (c (<')); and deleting free vertices closer than 2 · δ to z whenever a Steiner vertex z is inserted into a mesh.

7. The method of claim 6, wherein the dynamic insertion and deletion of points in a Delaunay fashion comprises further refining the meshes by: deleting, for any illegal facet, all vertices of the illegal facet that do not lie on the surface, and inserting a point p on Vor(f) f) dO;

rejecting, any sliver eircum center that eliminates a facet/that is legal, and inserting a point p on Vor(f) ft dO.

8, The method of claim 6, wherein the processing of a parallel real-time Image-to-Mesh conversion is by providing a plurality of parallel processing threads T_h each of threads T_t having a poor element list of tetrahedrons ι requiring refining.

9. The method of claim 8, further comprising the step of finding and triangulating cavities C of insertions.

10. The method of claim 8 further comprising balancing processing load among threads I- by one or more of random work stealing, hierarchical work stealing and static box partitioning.

1 1. The method of claim 8, further comprising managing contention by pausing at least one of the plurality of processing threads T_h without livelocks in the remaining threads,

12. The method of claim 8, further comprising locally managing contention by designating one of the plurality of processing threads Ί), as a main non-pausing thread and pausing as a function of the number of rollbacks at least one of the remaining of the plurality of processing threads T, without livelocks.

134. A non-transitory recording medium storing a program that instructs a computer to generate a mesh from an image, the method comprising: receiving a digital image of an object © having a volume and a surface d O; applying a virtual box of eight vertices to the image of object © and triangulating the vertices into a plurality of tetrahedral t; processing a parallel real-time Image-to-Mesh conversion using an isosurface based algorithm (P12M) for processing the digital image; wherein the system is configured to process the digital image without surface extraction preprocessing.

14. A system for performing a process on an image, the process being implemented by a computer system operatively connected to at least one data storage device in which is stored image data, and comprising at least one computer and at least one non-transitory computer readable medium storing thereon computer code which when executed by the at least one computer performs a method, the method comprising: receiving a digital image of an object © having a volume and a surface σ G; applying a virtual box of eight vertices to the image of object © and triangulating the vertices into a plurality of tetrahedral r; processing a parallel real-time Image~to-Mesh conversion using an isosurface based algorithm (ΡΪ2Μ) for processing the digital image; and wherein d e system is configured to process the digital image without surface extraction preprocessing.

I S. The system of claim 14, wherein the method comprises: recovering boundaries for the digital image; and generating meshes through a sequence of dynamic insertion and deletion of points in a Delaunay fashion until eircumcenters of all tetrahedra t lie inside 0.

16, The system of claim 14, wherein the image is a three-dimensional medical image.

17, The system of claim 1 5, wherein the processing of a parallel real-time Image-to-Mesh conversion is by providing a plurality of parallel processing threads T each of threads Ί) having a poor element list of tetrahedrons / requiring refining.

18, The system of claim 15, wherein the meshes comprise a volume mesh and an isosurface mesh of the object © derived from the image and not from a polyhedral domain.

19. The system of claim 18, wherein δ is a predetermined sampling density, the meshes include a plurality of tetrahedrons t, and wherein the dynamic insertion and deletion of points in a Delaurtay fashion comprises refining the meshes by: inserting z if z is at a distance not closer than δ to any other isosurface vertex, for intersecting tetrahedron t and wherein z is equal to cfp (e (t));

inserting c(t) if radius r(t) > 2 · 5 for intersecting tetrahedron t and wherein z is equal to cfp (c

(0);

inserting c _suff {/) wherein / is a restricted facet and either a smallest planar angle of / is less than 30° or a vertex of / is not a feature vertex;

inserting c(f) for interior tetrahedron t having a radius-edge ratio larger than 2;

inserting c(r) for interior tetrahedron t if r(f) > a user defined size function sf (c(?)); and deleting free vertices closer than 2 · δ to z whenever a Steiner vertex z is inserted into a mesh.

20. The system of claim 19, wherein the dynamic insertion and deletion of points in a Delaunay fashion comprises further refining the meshes by: deleting, for any illegal facet, all vertices of the illegal facet that do not lie on the surface, and inserting a point /? on Vor(f) n o'O;

rejecting, any sliver cireumeenter that eliminates a facet /that is legal, and inserting a point p on Vor(f) n 30.

21. The system of claim 1 9, wherein the processing of a parallel reai-time Image-to-Mesh conversion is by providing a plurality of parallel processing threads Tj, each of threads Ί) having a poor element list of tetrahedrons requiring refining,

22. The system of claim 21 , further comprising the step of finding and triangulating cavities C of insertions.

23. The system of claim 21 further comprising balancing processing load among threads 7; by one or more of random work stealing, hierarchical work stealing and static box partitioning.

24, The system of claim 21 , further comprising managing contention by pausing at least one of the plurality of processing threads T_h without Hvelocks in the reinaining threads.

25. The system of claim 21 , further comprising locally managing contention by designating one of the plurality of processing threads T_h as a main non-pausing thread and pausing as a function of the number of rollbacks at least one of the remaining of the plurality of processing threads 7} without Hvelocks.