WO2014032008A2 - Procédé et système pour générer un maillage à partir d'images - Google Patents
Procédé et système pour générer un maillage à partir d'images Download PDFInfo
- Publication number
- WO2014032008A2 WO2014032008A2 PCT/US2013/056473 US2013056473W WO2014032008A2 WO 2014032008 A2 WO2014032008 A2 WO 2014032008A2 US 2013056473 W US2013056473 W US 2013056473W WO 2014032008 A2 WO2014032008 A2 WO 2014032008A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- threads
- mesh
- image
- thread
- vertices
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
- G06T17/20—Finite element generation, e.g. wire-frame surface description, tesselation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H30/00—ICT specially adapted for the handling or processing of medical images
- G16H30/20—ICT specially adapted for the handling or processing of medical images for handling medical images, e.g. DICOM, HL7 or PACS
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H30/00—ICT specially adapted for the handling or processing of medical images
- G16H30/40—ICT specially adapted for the handling or processing of medical images for processing medical images, e.g. editing
Definitions
- the present invention relates to solid modeling through computer graphics
- image-to-mesh (1.2 ) conversion enables patient-specific finite Element modeling in image guided diagnosis and therapy.
- the ability to tessellate a medical image of multiple tissues into teirahedra enables Finite Element (FE) Analysis on patient-specific models This has significant implications in many areas, such as imaged-guided therapy, development of advanced patient- specific blood flow mathematical models for the prevention and treatment of stroke, patient-specific interactive surgery simulation for training young clinicians, and study of biomechanical properties of collagen nano-straws of patients with chest wail deformities, to name just a few.
- FE Finite Element
- Delaunay refinement is a popular automatic mesh generation and refinement method which generates and refines meshes by present algorithms and at the same time ensures termination and good grading.
- Mesh generation by Delaunay refinement is a widely used technique for constructing guaranteed quality triangular and tetrahedral meshes. The quality guarantees are usually provided in terms of the bounds on circumradius-to-shortest edge ratio and on the grading of the resulting mesh.
- circumcenters of skinny elements and middle points of boundary faces and edges are used for the positions of inserted points.
- Delaunay refinement is very useful for parallel and/or time critical applications when human intervention and reruns are prohibitively time consuming.
- Delaunay refinement algorithms are based on the method of inserting new points into the mesh to improve the aggregate quality of elements (triangles in two dimensions or tetrahedra in three dimensions).
- Quality is traditionally defined as the ratio of the circumradius of the element to the length of its shortest edge.
- This measure leads to the improvement of the minimum angle in two dimensions which helps to improve the conditioning of the stiffness matrix used by a field solver, in three dimensions this measure does not yield such direct benefits, however, it has been shown that bounded circumradius-to-shortest edge ratio of mesh elements is sufficient to obtain optimal convergence rates for the solution of Poisson equation using the control volume method.
- one of the problems with Delaunay refinement in 3D is that it allows only for a bound on circumradius-to- shortest edge ratio of tetrahedra, which does not help to improve the dihedral angles.
- a bound represents a criterion for evaluating the quality of an element in the mesh.
- slivers can survive.
- post-processing techniques to eliminate slivers as referenced in U.S. Application Publication No. 2012/0154397. While some of them have been shown to produce very good dihedral angles in practice, no implementation is available that can guarantee significant ( and above) dihedral angle bounds.
- an I2M solution can benefit from the fact that images provide more information on their structure than general surfaces.
- Heuristic solutions to the ⁇ 2 ⁇ problem fall into two categories: (I) first coarsen the boundary of the image, and then apply CAD-based algorithms to construct the final mesh, (2) construct the mesh which covers the image, and then warp some of the mesh vertices onto the image surface.
- the first approach tries to address the fidelity before the quality requirements, while the second approach does it in reverse order, Neither of these approaches is able to guarantee the quality of elements in terms of dihedral angles. As a result, the output of one step does not produce an optimal input for the other step.
- the present disclosure is directed to a system and method therefor for generating mesh from an image.
- the present disclosure is directed to a recording medium storing a program that instructs a computer to generate a mesh from an image according to the above claimed methods.
- the method includes processing a parallel real-time image-to- Mesh conversion using an algorithm (PI2M) for processing an image; wherein the system is configured to process the image without surface extraction pre-processing,
- P2M algorithm
- the method includes recovering boundaries for an image and generating meshes through a sequence of dynamic insertion and deletion of points in a Delauna fashion.
- the image can be a three- dimensional medical image.
- the system implementing the method can be configured to process images tor at least one of finite element analysis, simulation, interpolation, computer vision, or robotics
- Embodiments include a method for generating mesh from an image, the method being implemented by a computer system operative! ⁇ ' connected to at least one data storage device for storing image data for images, at least one computer, and at least one computer readable medium storing thereon computer code which when executed by the a least one computer performs the method, in some cases, the method will include (a) receiving a digital image of an object ⁇ having a volume and a surface 9 ⁇ ;(b) applying a virtual box of eight vertices to the image of object ⁇ and triangulating the vertices into a plurality of tetrahedral t; (e) processing a parallel real-time Image-to- Mesh conversion using an isosurface based algorithm (PI2M) for processing the digital image; and (d) wherein the system is configured to process the digital image without surface extraction preprocessing.
- This method may include the steps of recovering boundaries for the digital image and generating meshes through a sequence of dynamic insertion and deletion of points in a DeSaunay fashion until circumceniers of ali teirahedra i lie inside ⁇ .
- Meshing may comprise the generation of a volume mesh and an isosurface mesh of the object ⁇ derived from the image and not from a polyhedral domain.
- the image may be a three-dimensional medical image.
- the method is adaptable to processing of the parallel real-time Image-to- Mesh conversion is by providing a plurality of parallel processing threads 7j, each of threads 7 ⁇ having a poor element list of tetrahedrons t requiring refining.
- the meshes include a plurality of tetrahedrons i, /is a facet, and wherein the dynamic insertion and deletion of points in a Delaunay fashion comprises refining the meshes.
- This refining can include the application of refining rules, such as: inserting z ⁇ fz is at a distance not closer than ⁇ to any other isosurface vertex, for intersecting tetrahedron and wherein z is equal to cfp (c (t));
- Additional refining rules may include: (a) deleting, for any illegal facet, all vertices of the illegal facet that do not lie on the surface, and inserting a point p on Vor(f) f) 5 ⁇ ; and (b) rejecting, any sliver circumcenter that eliminates a facet/ that is legal, and inserting a point/; on Vor(f) ⁇ dO.
- the processing of a parallel real-time Image-to-Mesh conversion may be by providing a plurality of parallel processing threads T>, each of threads ⁇ ) having a poor element list of tetrahedrons t requiring refining.
- the method may include the step of finding and triangulating cavities C of insertions.
- the method may include the balancing processing load among threads 7) by one or more of random work stealing, hierarchical work stealing and static box partitioning
- the method may include managing contention in one of a variety of means, including by pausing at least one of the plurality of processing threads without livelocks in the remaining threads, For example, contention may be managed locally by designating one of the plurality of processing threads T>, as a main non-pausing thread and pausing as a function of the number of rollbacks at least one of the remaining of the plurality of processing threads T. without livelocks.
- the present approach extends to a non-transitory recording medium storing a program that instructs a computer to generate a mesh from an image, the method comprising receiving a digital image of an object ⁇ having a volume and a surface d ⁇ ; applying a virtual box of eight vertices to the image of object ⁇ and triangulating the vertices into a plurality of tetrahedral /; processing a parallel real-time Image-to-Mesh conversion using an isosurface based algorithm ( ⁇ .2 ⁇ ) for processing the digital image; wherein the system is configured to process the digital image without surface extraction pre-processing.
- Such embodiments may include the variations common to such methods, including the above described optional embodiments.
- the present approach extends to a system for performing a process on an image, the process being implemented by a computer system operatively connected to at least one data storage device in which is stored image data, and comprising at least one computer and at least one non-transitory computer readable medium storing thereon computer code which when executed by the at least one computer performs a method.
- the method may include ail variations disclosed herein for such method.
- the method may include the steps of: (a) receiving a digital image of an object ⁇ having a volume and a surface 3 O; (b) applying a virtual box of eight vertices to the image of object ⁇ and triangulating the vertices into a plurality of tetrahedral t; (c) processing a parallel real-time Image-to-Mesh conversion using an isosurface based algorithm (PI2M) for processing the digital image: and (d) wherein the system is configured to process the digital image without surface extraction preprocessing.
- P2M isosurface based algorithm
- FIG. 1 illustrates aspects of system and storage or media embodiments.
- FIG. 2(a) shows an image of a sample set of a liver's surface according to an embodiment.
- FIG. 2(b) shows the Deiaunay trianguiation of the samples.
- FIG. 2(c) shows the restricted trianguiation D j0 (V) according to an embodiment.
- FIG. 3(a) shows an image of a sample set of a liver's surface according to an embodiment.
- F IG. 3(b) shows the Deiaunay trianguiation of the samples.
- FIG, 3(c) shows the restricted trianguiation D jC , (V) according to an embodiment.
- FIG. 4 shows an implementation design
- FIGS. 5(a) and (b) show two-dimensional illustrations with re-triangulated areas
- FIG. 5 shows a graph representing the number of threads against non-local PEL removals.
- FIG. 6 includes tables of specification and optimization.
- FIG. 7 is a table of mesh statistics.
- FIG. 8 is a chart of PEL removals
- FIG. 9 illustrates a mesh with statistics.
- FIG, 10 is a table of contention manager performance.
- FIG. l 1 shows scaling performance
- FIGS, 12(a) and (b) show exemplary communication patterns for embodiments of the present invention
- FIG. 13 shows data about scaling performance of Hierarchical Work. Stealing.
- FIG. 14 illustrates a thread about to busy-wait.
- FIG. 15 is a comparison among contention managers.
- FIG. 16 is a chart of specifications.
- FIG. 17 shows data about scaling performance of Random Work Stealing.
- FIG. 1 8 shows data about weak scaling performance.
- FIG. 19 is an overhead data chart.
- FIG. 20 is a chart of weak scaling performance.
- FIGS. 21(a) and (b) illustrate aspects of point rejections strategy.
- FIG. 22(a) illustrates a virtual box meshed into 6 tetrahedra.
- FIG, 22(b) illustrates a final mesh being carved by rules.
- FIG. 22(c) illustrates an end set with tetrahedrai whose circumcenter lies inside the object as a geometrically and topological! ⁇ ' correct mesh.
- FIG. 23 shows aspects of an embodiment of a parallel mesh generation algorithm.
- FIGS. 24(a)-(c) show aspects of an embodiment of a parallel mesh generation algorithm.
- FIG. 25 illustrates a contention manager time steps
- FIG. 26 is a table of thread execution data.
- FIG. 27 is a table of input image data.
- FIGS. 28(a)-(b) show output meshes.
- FIG. 29 is a table of single thread performance data
- FIG. 1 depicts an example of one such computer system 100.
- the computer system 100 further includes at least one input device 1 14 such as, e.g., a keyboard, mouse, touch pad or screen, or other selection or pointing device, at least one output device 1 16 such as, e.g., an electronic display device, at least one communications interface 1 18.
- at least one computer readable medium or data storage device 1.20 such as a magnetic disk or an optical disk and memory 122 such as Random- Access Memory (RAM), each coupled to the communications channel 1 12,
- the communications interface 1 18 may be coupled to a network 142.
- the system 100 may include multiple channels or buses 12, various arrangements of storage devices 120 and memory 122, as different units or combined units, one or more computer-readable storage medium (CRSM) readers 136, such as, e.g., a magnetic disk drive, magneto-optical drive, optical disk drive, or flash drive, multiple components of a given type, e.g., processors 1 10, input devices 1 14, communications interfaces 1 18, etc.
- CRSM computer-readable storage medium
- computer system 100 communicates over the network 142 with at least one computer 144, which may comprise one or more host computers and/or server computers and/or one or more other computers, e.g. computer system 100, performing host and/or server functions including web server and/or application server functions.
- a database 146 is accessed by the at least one computer 144.
- the at least one computer 144 may include components as described for computer system 100, and other components as is well known in the computer arts.
- Network 142 may comprise one or more LANS, WANS, intranets, the Internet, and other networks known in the art.
- computer system 100 is configured as a workstation that communicates with the at least one computer 144 over the network 1 2.
- computer system 100 is configured as a client in a client-server system in which the at least one other computer comprises one or more servers. Additional computer systems 100, any of which may be configured as a work station and/or client computer, may communicate with the at least, one computer 144 and/or another computer system 100 over the network 142.
- Non-limiting exemplary embodiments of systems, methods and recording mediums which can implement the embodiments disclosed herein may be found in U.S. Patent Application Serial No. 13/31 1 ,335, filed on December 5, 201 1 , which claims priority to Provisional Application No.
- Non-limiting exemplary embodiments of systems, methods and recording mediums which can implement the embodiments disclosed herein may also be found in U.S. Patent Application Serial No, 13/31 1 ,2.89, filed on December 5, 201 1 , which claims priority to Provisional Application No. 61/419,603 filed on December 3, 2010, the entirety of each of which is incorporated herein by reference.
- a parallel Image-to-Mesh Conversion (12M) algorithm with quality and fidelity guarantees achieved by dynamic point insertions and removals is provided, Starting directly from an image, it is able to recover the surface and mesh the volume with tetrahedra of good shape.
- the tightly-coupled shared-memory parallel speculative execution paradigm employs carefully designed memory and contention managers, load balancing, synchronization and optimization schemes, while it maintains high single-threaded performance: th present single-threaded performance is faster than CGAL, the state of the art sequential ⁇ 2 ⁇ software.
- the meshes according to the present invention come also with theoretical guarantees: the radius-edge ratio is less than 2 and the planar angles of the boundary triangles are more than 30 degrees.
- the effectiveness of the present invention is shown on Blackiight, the large cache-coherent NUMA machine of Pittsburgh Supercomputing Center.
- the present invention observes a more than 90% strong scaling efficiency and a super!inear weak scalirsg efficiency for up to 32 cores (2 nodes connected via one or more network switches). Due to the inherent irregularity of the application of the present invention and the latency involved in inter-node communication, the performance on more nodes deteriorates, but a strong scaling efficiency of more than 50% and a weak scaling efficiency of more than 70% for up to 64 cores (4 nodes) can be obtained.
- Image-to-mesh (12 M) conversion enables patient-specific Finite Element modeling in image guided diagnosis and therapy.
- the ability to tessellate a medical image of multiple tissues into tetrahedra enables Finite Element (FE) Analysis on patient-specific models. This has significant implications in many areas, such as imaged-guided therapy, development of advanced patient-specific blood flow mathematical models for the prevention and treatment of stroke, patient-specific interactive surgery simulation tor training young clinicians, and study of biomechanical properties of collagen nano-straws of patients with chest wall deformities, to name just a few.
- a parallel high quality Delaunay Image-to-Mesh Conversion (P12M) algorithm is disclosed herein which, starting from a segmented (perhaps multi-labeled) image, recovers the isosurface 60 of the object ⁇ and meshes the volume at the same time, offering quality and fidelity guarantees, in the parallel mesh generation literature, either O is given as an initial mesh or 5 ⁇ is already represented as a polyhedral domain.
- the present approach meshes both the volume and the isosurface directly from an image and not from a polyhedral domain. Therefore, the algorithm of the present invention exploits parallelism from the beginning of mesh refinement: starting directly from an image, and recovers the underlying surface and meshes the volume at the same time,
- PI2M recovers the tissues' boundaries and generates quality meshes through a sequence of dynamic insertion and deletion of points in a Delaunay fashion. None of the parallel Delaunay refinement algorithms are known to support point removals. Point removal, however, offers new and rich refinement schemes which can be very effective in practice. For example, a parallel Triangulator is able to support fully dynamic insertions and removals. Therein, however, is dealt only with the conve hull of point sets ignoring quality or surface conformity. High quality and fidelity I2M is a very dynamic process that increases parallel complexity even more.
- the present invention builds on top of the Triangulator and implements a complete parallel ⁇ 2 ⁇ scheme able to recover the surfaces of the tissues and mesh the underlying volume with high quality elements
- PSLG polygonal surface
- One method of the present invention is able to produce millions of high-quality elements within seconds respecting at the same time the exterior and interior boundaries of tissues. And this is possible because the input of the algorithm according to the present invention is not a given polyhedral domain. In contrast, on the fly, the zero-surface and the volume are meshed
- a parallel ⁇ 2 ⁇ tool is not known which, starting directly from a segmented image, is able to perform the operations above at once.
- the functionality of the present invention renders the described method suitable for both model visualization and real-time finite element (FE) analysis required in medical simulators, FE analysis is sensitive to the quality of tetrahedral meshes.
- FE analysis is sensitive to the quality of tetrahedral meshes.
- quality metrics used to improve mesh quality such as aspect ratio, size, scaled Jacobian, dihedral angles and others, as referenced herein.
- the present invention chooses aspect ratio, radius- edge ratio, size, and dihedral angles, because they are shown to improve the speed and robustness of medical application solvers dealing with isotropic materials.
- the present invention follows a tightly-coupled shared memory algorithmic model. Tightly-coupled, in order to develop a parallel framework able to accept various refinement schemes with minimal code change.
- the shared memory programming model fits best the forthcoming many-core chips demonstrating tens of cores.
- the present invention targets distributed shared-memory (DSM) machines equipped with both few and many more cores.
- DSM distributed shared-memory
- Delaunay refinement is a highly irregular and latency-bound application and is by far more dynamic in terms of resource management than dense matrix multiplication, FFT, structured adaptive mesh refinement, or N-body applications.
- the present invention shows the effectiveness of PI2M on Pittsburgh Supercomputing Center's Blaekiight Blacklight is a 2,046 hardware enabled cache- coherent NUMA (cc-NUMA) machine.
- the parallel implementation according to the present invention employs suitable memory management schemes, light-weight contention policies, custom load balancing techniques, and well- designed optimizations, as justified by extensive case studies and hardware event counting.
- the single-threaded version of the algorithm according to one embodiment of the present invention although it introduces extra synchronization and communication overhead (that is needed in order to accommodate more than one thread), it is highly optimized.
- this version of the algorithm of the present invention is consistently more than 30% faster than CGAL, the fastest open source mesh generation software known.
- the method presented here exhibits fast single-threaded performance, supports parallel Delaunay-based insertions and removals, conforms to user-specified quality metrics, recovers and meshes the isosurface with topological and geometric guarantees from the beginning of mesh refinement, and thus exploits more parallelism, allows the integration of any parallel refinement strategy, and fills the volume of the domain O with millions of high-quality tetrahedra within seconds, i.e., in real-time,
- Delaunay meshing is a popular technique for generating tetrahedral meshes, since it is able to mesh various domains such as: polyhedral domains, domains bounded by surfaces, or multi-labeled images, offering at the same time mathematical guarantees on the quality and the fidelity of the final mesh.
- Delaunay refinement techniques have been employed to mesh objects whose surface consists of piecewise linear elements which are explicitly given.
- the present approach deals with objects whose surface is a smooth 2-manifold and is not explicitly given; and it is the present invention's algorithm's responsibility to mesh both the surface and the interior of the object such that the mesh boundary describes the object surface well.
- the present invention makes use of a sample theory. Before introducing the sample theory, some terminology is necessary,
- ⁇ is represented as a cut function /:R 3 ⁇ , such that its surface dO is defined by the set ⁇ /(p) - 0).
- This assumption covers a wide range of inputs, including domains enclosed by curved surfaces as well, although this paper focuses on images.
- the zero-surface ⁇ / (p) ⁇ 0 ⁇ can be easily computed by interpolating the voxel values.
- VC R 3 be a set of vertices and (V) their Delaunay triangulation. Any Delaunay triangulation satisfies the empty hall property: the circumscribing open ball (a.k.a circiimhali) of each tetrahedron in V (V) does not contain any vertex.
- the voronoi point of a tetrahedron t e 2? is defined as the center (a.k.a circwncenter) of fs circiimhali.
- the voronoi edge of a triangle / € V (V ) is the segment containing those points of IV such that (a) they are equidistant from / 's vertices and (b) they are closer to s vertices than to any other vertex in V.
- ⁇ be the multi-label domain to be meshed.
- the surface of ⁇ is denoted with 9 ⁇
- the restriction oft ) (V ) to O (denoted with £ » j 0 ( V)) is defined as the set of tetrahedra in the trianguiation whose circumventer lies inside O.
- Typical values for ⁇ are usually fractions of the local, feature size of 6 . In the present application, ⁇ values equal to multiples of the voxel size is sufficient.
- C (t) denotes the circumventer of tetrahedron's t circumball and ⁇ ' ⁇ t) its radius.
- the function cfp (p) returns the surface point which is the closest to p.
- a present algorithm Given a segmented image, a present algorithm starts with a literally empty mesh. It initially inserts the corners (8 vertices) of a box containing the object, and it triangulates them. The computation of such a box can be done easily by a parallel image traversal Then, it dynamically computes new points to be inserted into or removed from the mesh maintaining the Delaunay property, This process continues, until certain Rules are satisfied. These Rules enforce fidelity and quality.
- the vertices removed or inserted are divided into 3 groups: free vertices, feature vertices, and box vertices.
- Box vertices are those lying exactly on the faces of the bounding box.
- Feature vertices lie on dO.
- One of the goals of the refinement is to insert many feature vertices in order to conform to Theorem 1 , Parameter 6 is specified by the user and controls the density of the sampling. The smaller ⁇ is, the better the boundary of the final mesh approximates ⁇ 3 ⁇ 4 s ).
- Users, who desire to represent certain parts of the domain in more detail, can pass to the present algorithm their own size function sf ( » ).
- Rl Let t be an intersecting tetrahedron and Z be equal to efp (c (t)). If Z is at a distance not closer than ⁇ to any other feature vertex, then z is inserted.
- R2 Let t he an intersecting tetrahedron and Z be equal to efp (C (t)) ⁇ If >"( ./ > 2 4 6, then c (t) is inserted.
- R3 Let / be a restricted facet. If either its smallest planar angle is less than 30° or a vertex of / is not. a feature vertex, then C 5U rf (/)- «s inserted.
- Lei t be an interior tetrahedron, if r(t) > sf (c (t) ⁇ , then c (t) h inserted.
- the dihedral angle improvement is performed via special point rejection strategies, as described below. These strategies remove points close to surface and insert others on the surface and are shown to work efficiently in practice, in all the present experiments, all the dihedral angles were larger than four degrees,
- the design in Figure 4 illustrates the basic building blocks of the present multi-threaded design.
- the Poor Element List (PEL) contains the poor eleiBents classified by the rule that applies to them. Based on the rule that applies to a poor element, a point is computed for insertion or removal that refines t. and is passed to the Triangulator. The new and deleted cells have to be updated and the new poor elements are pushed to a PEL according to the load balancing scheme. In case of a rollback, the contention manager is invoked.
- PEL Poor Element. List
- PEL. Each thread '/'.maintains a poor element (mulit) list (PEL.)
- a tetrahedron t can only reside in at most one PEL.
- a specific point p is computed for insertion or removal and is passed to the Triangulator.
- T. acquires it by locking its vertices, If t is already locked (by the Triangulator, see below), then T moves on to the next bad element in PEL .. If locking was successful, then a point p for insertion or removal is computed, according to the Rule that applies for t.
- Triangulator If the operation is an insertion, then the cavity C (p) needs to be computed and re-triangulated.
- the cavity is defined as the set of those elements whose eircumball contains p, i.e., they violate the Delaunay property.
- Computing the cavity involves two steps: (a) locating the first tetrahedron in C (p), and (b) delete the tetrahedra in C (p) and create new ones that satisfy the Delaunay property. For finding the location of the first element in the cavity, one may make use of the visibility walk. Specifically, starting from the poor element t, orientations tests are performed with respect to J?, until the first element that contains ? is reached.
- the second step is performed via the Bowyer- Watson kernel: point p is connected with the boundary triangles that are left in the cavity. It can be shown that the new elements do not intersect each others' interior and also satisfy the Delaunay property. See Figure 5 for an illustration in two dimensions: (a) the cavity C p) of p consists of elements whose circumcircle contains p It is re-triangulated by connecting j? with the vertices of the cavity; (b) the ball (p) of J? contains the elements incident to p . This area is re-trianguialed in Delaunay fashion by the "hail elements,"
- Removing p involves the computation and re-triangulation oi p ' S ball 2 (p). This is a more challenging operation than insertion, because the re-triangulation of the hole in degenerate cases is not unique which implies the creation of illegal elements, i.e., elements that cannot be connected with the corresponding elements outside the ball.
- any vertex touched during the operation of element location, cavity expansion, or ball filling need to be locked.
- Update Cells When an operation is successfully completed, the new and deleted cells have to be updated. Specifically, deleted elements already in J"s PEL have to be removed from the list and the newly created elements have to be checked in order to determine whether or not a Rule applies for them and pushed back into a Poor Element List, The PEE that the new poor elements are pushed into is determined by the Load Balancer.
- Load Balancer When the algorithm starts, only one thread (the main thread) has elements in its PEL, as a result of the triangulation of the initial box. Clearly, the work needs to be distributed to other threads as well from the very early stage of refinement. Implemented and compared were three load balancing techniques that naturally matched the present implementation design: random work stealing, hierarchical work stealing, and a static box partitioning. Load balancing is described in detail herein.
- CM Contention Manager
- a ⁇ library was used to count TLB+LLC misses and inter-socket QPI remote accesses. Used here are both CRTC (the present cc-NUMA machine) and Blacklight. See Figure 6 for the specifications of these systems.
- the intra-node communication is achieved through Intel's Quick Path interconnect (QPI).
- QPI also enforces cache-coherence (hardware enabled) via MESIF snooping. Accesses through QPI is an order of magnitude higher than accesses to the socket's local memory (as a result of L3 cache misses).
- the real overhead is when a socket has to talk outside its node.
- Blacklight's inter-blade commu ication is achieved via an imaging too, uch as the NUMAiink® 5 fat tree network available from Silicon Graphics International, Inc. Despite the network's sufficient bandwidth, there is a 2,000 cycle latency per hop, which is by far the bottleneck in communication.
- a goal therefore, was to: (a) decrease TLB+LLC misses for improved intra-socket communication, and (b) to decrease the inter-socket accesses as a result of inira-blade or (worse) inter-blade
- inter-blade accesses may be referred to as remote accesses.
- the main optimizations implemented include: (1) vector reuse, (2) custom memory pools, (3) class member packing, (4) local PEL cell removals, (5) replication, and (6) light-threaded memory pool interactions.
- the first two optimizations are extensions of the optimizations suggested by Antonopoulos, et al, so that dynamic removals can benefit as well.
- Figure 6, Table if summarizes the findings.
- Vector reuse During refinement, certain vectors are accessed very often. There are three types of such vectors: (a) the vector that stores the ceils in a cavity to be deleted, (b) the vector that stores the ball ceils in V to be connected back to the shared triangulation, and (c) the vector that
- the first column corresponds to a mesh of 42,049 elements and the second to a much larger mesh of 833,860 elements. Both meshes were created for the knee atlas, Although the ma length values strongly depend on the size of the mesh, the average values are consistent and seemingly independent of the size of the mesh. (Slight change in the average number of cells in a cavity agrees with the findings in as well). Therefore, for each vector, space was initially reserved that was able to hold its average number of elements. Vector reuse boosts the speed by a little (see first row of Figure 6, ⁇ able 11(a)), which contradicts the finding in where a 36% improvement is reported.
- Custom memory manager Rather than allocating and de-allocating vertices or cells one at a time, the present invention maintains a memory pool for ceils and vertices for each thread. When a new element is created, it requests space from its (private) pool. If an element is deleted, its memory returns back to the pool, and therefore the memory footprint is decreased which improves both the single and the multi-threaded application, Case studies show that the number of cells is roughly 3 times more than the number of vertices created during the refinement. The present invention sets the chunk size tor ceils three times larger than that for the vertices. Setting the chunk size for the vertices o yields the best results. See the second rows of Figure 6, Table 11(a). In contrast to vector reuse, the memory pools paid off. This is also confirmed by the misses decrease.
- Class Member Packing The classes that represent the cell and the vertex need to store boolean variables used for checking the status during a traversal, the invalidations, and the rule that applies for them, Instead of declaring them as boolean, they were packed in BitSets. so that every variable occupies now a single bit. This decreases the memory occupied per ceil and it also results in a more compact alignment, The size of the cell and vertex class were now reduced by 20% (88 bytes totally) and 40% (48 bytes totally), respectively, and therefore, caching is likely to be improved. Indeed, the savings in TLB+LLC misses (see third row of Figure 6, Table 11(a)) are more than those of vector reuse. Although the memory manager is on, there was still a small improvement in time,
- T . is not allowed to remove C. from PEL., Rather, it raises c.'s invalidation j ⁇ j
- CGAL is the state-of-the-art mesh generation too! widely accepted by the community, it is a well-tested and robust library, and is believed to be the fastest sequential Delaunay mesh generator which operates directly on images
- CGAL incorporates many optimizations that speeds up the performance such as: Delaunay hierarchies for locating, ray shooting, occasional switch from the BW kernel to edge flipping, object counters, and so forth. Although these choices improve the performance on a single core, their paraileiization is quite difficult and might not. pay off. (For example, CGAL's reference counters render even the read accesses thread unsafe.)
- FIG. 9(a) illustrates the mesh generated by PI2M in accordance with the present invention
- Figure 9(b) illustrates a cross-section thereof.
- CRTC see- Figure 6. Table ⁇ for its speci ications
- CM Contention Manager
- the computation of the progress ratio is performed locally by each thread without synchronization. Although the user is required to specify the values of three parameters (p , pr, pr*), timings were not observed to vary considerably for different triplets. For this study, the triplet is assigned to (0.1 , 0.2, 0.9).
- globai-CM there are global variables that are accessed by all threads.
- the suspend/awake function is implemented as a global condition variable.
- the number of "active" threads should be tracked down in order to prevent the suspension of the last active- thread. This cannot scale well for large number of cores, Therefore, global ⁇ CM can be replaced by locai-CM: a manager with zero communication and synchronization among threads.
- Thread T now counts the number of rollbacks ri it encounters that were not interrupted by any successfully operation. That is, if 7) performs a successful operation ⁇ - is set to 0.
- T sleeps for r,- x f seconds (f is a parameter).
- the main thread is not allowed to sleep at all. This policy is locally performed by each thread and it also guarantees the absence of livelocks. The reason is that if all threads get stuck, then they will eventually sleep for a long time, letting at least one thread (the main thread) to do some useful work, in one embodiment, setting r to the duration of 5,000 clock cycles yielded the best results.
- the final mesh consists of about 80,000 tetrahedra and the number of threads is twelve.
- Even in this case of a very small mesh in the literature, speedups are reported for multi-million element meshes), considerable improvements are obtained,
- Generating small meshes (i.e., highly contented meshes) in real time is an application best suitable to many-core chips; they can drastically benefit from contention managers.
- This technique speeds up also the generation of larger meshes (i.e., the ease of more cores), since the reduction of rollbacks implies more useful FLOPS earlier in the refinement.
- the single threaded execution time was 1.19 seconds, in one embodiment.
- Launching twelve- threads without any contention policy caused a livelock, Globai-CM solved the problem immediately yielding a speedup of 1 ,9.
- the 8-fold speed up of locai-CM reduced not only the number of rollbacks (as compared to globai-CM), but also the number of idle seconds. This improvement may be due to three reasons: (a) there is extra overhead associated to a condition variable call than a simple sleep command, (b) threads are given the chance to sleep less time since the time they go to sleep is a multiple of //sees proportional to their failure rate, and (c) iocal-CM requires no
- Load imbalance is a challenging problem associated with both regular and irregular applications, and may involve work stealing, multi-list, diffusive, gradient, and re-partitioning techniques, it is believed thai these techniques behave very well (some better than others) for unsiructured mesh generation, assuming that the tasks are independent. In other referenced works, however, it is also shown that equi-distributing the tasks among processors would not yield good results for mesh generators characterized by intensive inter-task dependencies.
- the present approach may implement the traditional work stealing (WS) as a base load balancer.
- WS work stealing
- a poor element list (PEL) of a thread 2 ⁇ is empty of elements
- Ti writes its ID to the begging list, a global array that tracks down threads without work.
- a running thread 7 ⁇ every time it completes an operation, it gathers the newly created threads and places the ones that are poor to the PEL of the first thread 7) found in the begging list.
- the classification of whether or not a newly created cell is poor or not. is done by 7 ⁇ .
- Figure 1 1 is a graph showing scaling performance on Blackiight achieved by Work Stealing (WS), scaling performance on Blackiight achieved by Hierarchical Work Stealing (HWS), and scaling performance on Blackiight achieved by HWS with Static image decomposition
- the present approach implements a Hierarchical Work Stealing scheme (BWS) by taking advantage of the architecture of Blackiight,
- BWS Hierarchical Work Stealing scheme
- Bl Level ! begging list per socket
- B2 Leve!2 begging list per blade
- B3 global LeveB begging list
- Bl is accessed only by the S cores (i.e., up to 16 threads if hyper-threading is enabled) of a specific socket.
- the last thread asking for work does not use Bl , but instead, it promotes itself to the second level begging list B2.
- B2 is accessed by only two threads per blade (one thread per socket) at a given time.
- the second thread of the "B2 group” is promoted to the last level begging list B3, in the worst case, B3 will be accessed by a thread of each blade.
- a running thread Ti which completed an operation and gathered the newly poor elements, sends new work to the begging threads giving priority to the lower level begging lists first. Only when Bl and B2 of T j are empty, T j accesses the global begging list See the lines of Figure 1 l(a)-(c). HWS greatly reduces the number of remote accesses and the associated latency, thus achieving a better speed-up than the plain WS.
- H WS can be modified such that a thread asks for work, when the number of its bad elements drops below a certain threshold. This change not only did not improve performance, but it actually increased the execution time by 2 to 10 seconds, potentially for two reasons.
- PELs Poor Element Lists
- Figure 12(a) depicts a communication pattern for Hierarchical Work Stealing (HWS) and figure 12(b) depicts a communication pattern for Hierarchical Work Stealing with Decomposition (HWS+Static).
- Figure 12(a) depicts the cells created by each thread in different shading, including areas where intensive thread communication and synchronization take place. In certain areas, three or even four present threads meet, in order to decrease thi communication, the image is divided into structured blocks. Each block is assigned to a specific thread, The goal is to force the threads to operate only on their dedicated block as much as possible. For this reason, if a newly created element t of thread Tjs classified as poor, 7 ⁇ determines which block contains the point that has to be inserted/removed for t.
- HWS-t-Static are more than HWS's
- Table V depicts the weak scaling performance for HWS, (HWS is fester than HWS+Statie by 1 % up to 50%.)
- the algorithm of the present invention ran on two configurations: one thread per core (non hyper threaded configuration, or n-FIT) and two threads per core (hyper- threaded configuration or HT), Note that the size of the problem increases with respect to the number of threads, Also, the problem size increases slower than it should.
- the algorithm of the present invention is applied on the same problem sizes, but each core accommodates two threads. Therefore, the HT configuration involves half the blades of the corresponding n-FIT configuration, and hence, it is expected to reduce the QPI accesses.
- the 32-thread (1 blade) HT configuration involves 70% more QPI accesses than the corresponding n-HT configuration (2 blades). This is counter-intuitive (since the HT configuration has fewer blades and therefore less communication), but it can be explained by the fact that there are more threads per socket and the possibility to talk to another socket increases, in the case where the QPI accesses do decrease (by 21 % on 256 threads and by 100% on 16), the resource sharing still deteriorates performance.
- the present approach thereby provides PI2M as a novel parallel Image-to-Mesh (I2JV1) Conversion tool Starting directly from a segmented 3D image, the present approach is able to recover and mesh both the isosurface of the domain with geometric and topological guarantees and the underlying volume with quality elements.
- I2JV1 Image-to-Mesh
- This approach shows excellent strong and weak scaling performance given the dynamic nature of its application. Specifically, the present invention observes a 91 % strong sealing efficiency and a superlinear weak scaling efficiency for up to 32 cores (2 nodes connected via network switch). For a higher number of nodes, the overhead of the irregular thread communication dominates deteriorating performance. The approach shows that 28% (16 threads) up to 33% ( 128 threads) of the total number of poor elements appeared in the mesh were not refined by the same threads that created them.
- the present invention provides evidence that: the use of custom memory managers per thread, the decrease of the cell and vertex size (member packing), and the reduced synchronization of the threads' Poor Element Lists (local PEL cell removal) yielded a 46% total reduction of TLB and LLC misses, and as a result a 30% total reduction of the execution time.
- the present invention shows that:
- replicating structures across sockets and minimizing remote memory re-use (lightweight pools) achieved a 99,3% total reduction of Intel QFI accesses, and as a result a 73% total reduction of the execution time.
- the QPI remote accesses are also decreased by up to 60% when it used the
- HWS Hierarchical Work Stealing
- the PI2M according to the present invention exhibits excellent single- threaded performance, Despite the extra overhead associated with synchronization, contention management, and load balancing, the present invention generates meshes 30% faster than CGAL and with similar quality,
- the present invention thus provides a flexible 12M infrastructure able to accept any refinement strategies with minimal code change.
- the present approach first tries to convert illegal facets to legal ones, with legal facets defined to be those restricted facets whose vertices He precisely on ⁇ ,. Conversely, a restricted facet with at least one vertex not lying on 3 ⁇ is called an illegal facet.
- R7 Let i be a sliver and c its circumcenter, if c eliminates a legal facet f, then c is rejected. Instead , a point p on Not if) ⁇ 6 ⁇ is inserted (See Figure 12(b)).
- Tetrahedron t is considered to be a sliver if ⁇ ( ⁇ ) is less than 0.06, The reason this value was chosen is because: (a) it introduces a small size increase (about 15%) over the mesh obtained without the present silver removal heuristic (he,, without rules R6 and R7), and (b) it introduces a negligible time overhead. In the Experimental section it is shown that this 0,06 bound corresponds to meshes consisting of tetrahedra with dihedral angles between 4.6 " and 171 ° ,
- Figure 22(a) shows a virtual box meshed into six tetrahedra
- Figure 22(b) shows during refinement, the final mesh being gradually refine according to the Rules
- Figure 22(e) shows, at the end, the set of tetrahedra! whose circumcenter lies inside ⁇ as the geometrically and topological! ⁇ ' correct mesh M.
- Typical values for ⁇ are usually fractions of the local feature size of dO.
- Other literature identifies well-defined 5 parameters. In this approach, ⁇ values equal to multiples of the voxel size is sufficient.
- an algorithm may first constructs a virtual box which encloses O, The box is then triangulated into 6 tetrahedra, as shown in Figure 22, This is the only sequential part of the method. Next, it dynamically computes new points to he inserted into or removed from the mesh maintaining the Delaunay property, This process continues, until certain fidelity and quality criteria are met. Specifically, the vertices removed or inserted are divided into 3 groups: isosurface vertices, circumcenters, and surface-centers.
- the isosurface vertices will eventually form the sampling of the surface so that Theorem 1 holds together with its theoretical guarantees about the fidelity of the mesh boundary.
- c(r) be the eircumcenter of a tetrahedron t
- the present algorithm inserts the isosurface vertex which is the closest to c(t).
- the Closest isoSurface vertex of a point p is referred to as p E BO.
- the isosurface vertices (like the circumcenters) are computed during the refinement dynamically with the help of a parallel Euclidean Distance Transformation (EDT), Specifically, the EDT returns the surface voxel p which is closest to p.
- EDT Euclidean Distance Transformation
- a surface-voxel is a voxel that lies inside the foreground and has at least one neighbor of different label. Then, one may traverse the ray pp'on small intervals and compute p dO by interpolating the positions of different labels.
- the density of the inserted isosurface vertices is defined by the user by a parameter 0 ' > 0, A low value for 5 implies a denser sampling of the surface, and therefore, according to Theorem L a better approximation of dO.
- the eircumcenter c(t) of a tetrahedron t is inserted when t has low quality (in terms of its radius-edge ratio) or because its circumradius r(t) is larger than a user-defined size function sf ( ⁇ ) ⁇ Circumcenters might also be chosen to be removed, when they lie close to an isosurface vertex, because in this case termination is compromised,
- V (f " ⁇ of / is the segment connecting the circumcenters of the two tetrahedra that contain / .
- the intersection V if ) ft dO is called a surface-center and is denoted by c surf (f).
- surface-centers are computed similarly to the isosurfaces (i.e., by traversing V (f) on small intervals and interpolating positions of different labels) and inserted into the mesh to improve the planar angles of the boundary mesh triangles and to ensure that the vertices of the boundary mesh triangles lie precisely on the isosurface.
- R4 Let t be a tetrahedron whose circumeenter lies inside O, if its radius-edge ratio is larger than 2, then c (t) is inserted.
- Rl and R2 are responsible for creating the appropriate dense sample so that the boundary triangles of the resulting mesh satisfies Theorem 1 and thus the fidelity guarantees.
- R3 and R4 deal with the quality guarantees, while R.5 imposes the size constraints of the users.
- R6 is needed so termination can be guaranteed.
- the radius-edge ratio of ell elements in the mesh is less than 2
- the planar angles of the boundary mesh triangles is less than 30°.
- Each thread Tt may maintain its own Poor Element List (PEL) PEL,
- PEL* contains the tetrahedra that violate the Refinement Rules and need to be refined by thread ⁇ , ⁇ accordingly.
- An operation that refines an element can be either an insertion of a point p or the removal of a vertex p.
- the cavity C (p) needs to be found and re-triangulated according to the well known Bovvyer- Watson kernel Specifically, C (p) consists of the elements whose circumsphere contains p. These elements are deleted (because they violate the Delaunay property) and p is connected to the vertices of the boundary of C (p) , in the case of a removal, the ball B (p) needs to be re-triangulated.
- GCC's atomic built-in functions are used for this goal, since they perform faster than the conventional pthread try__locks. Indeed, replacing p thread locks (the present first implementation) with GCC's atomic built-ins (current implementation) decreased the execution time by 3.6% on 1 core and by 4.2% on 12 cores.
- new cells are created and some cells are invalidated,
- the new cells are those thai re-triangulate the cavity (in case of an insertion) or the bail (in case of a removal) of a point p and the invalidated cells are those that used to form the cavity or the ball of p right before the operation.
- T determines whether a newly created element violates a rule, if it does, then ⁇ ; pushes it back to PELi (or to another thread's PEL, see below) for future refinement, Also, ⁇ , ⁇ removes the invalidated elements from the PEL they have been residing in so far, which might be the PEL of another thread.
- T s removes c from PELj only if ⁇ ) belongs to the same socket with T, . Otherwise, 7 ⁇ raises cell c's invalidation flag, so that 7 ⁇ can remove it when 7 ⁇ examines c.
- Load Balancing is a fundamental aspect of the present implementation, A base (i.e., not optimized) Load Balancer is the classic Random Work Stealing RHW) technique, since it best fits the present implementation design, Asi optimized work stealing balancer has been implemented that takes advantage of the N ' UMA architecture and achieves an excellent performance,
- a running thread 7) every time it completes an operation (i.e., a Delaunay insertion or a Delaunay removal), it gathers the newly created elements and places the ones that are poor to the PEL of the first thread , found in the begging list, The classification of whether or not a newly created cell is poor or not is done by ⁇ . , ⁇ ) also removes T t from the Begging List,
- each thread ⁇ ⁇ maintains a counter that keeps track of all the poor and valid cells that reside in PEL,- . 7'. ⁇ is forbidden to give work to a thread, if the counter is less than a threshold. That threshold is set equal to 5, since it yielded the best results.
- CM Contention Manager
- CM Contention Manager
- the present contention techniques were implemented: the Aggressive Contention Manager (Aggressive-CM), the Random Contention Manager (Random-CM), the Global Contention Manager (Global-CM), and the Local Contention Manager (Local-CM).
- Aggressive-CM the Aggressive Contention Manager
- Random-CM Random Contention Manager
- Global-CM the Global Contention Manager
- Local-CM the Local Contention Manager
- the Aggressive-CM and Random ⁇ CM are non-blocking schemes, As is usually the ease for non-blocking schemes, the absence of Iiveiocks was not proven for these techniques. Nevertheless, they are useful for comparison purposes as Aggressive-CM is the simplest to implement, and
- the Global-CM is a blocking scheme and it is proven that it does not introduce any deadlock, (Blocking schemes are guaranteed not to introduce Iiveiocks),
- Local-CM Semi-blocking, that is, it has both blocking and non-blocking parts. Because of its (partial) non-blocking nature, it was found difficult to prove starvation-freedom, but it could guarantee absence of deadlocks and Iiveiocks, It should be noted, however, that the inventors have never experienced any thread starvation when using Local-CM: all threads in ail case studies made progress concurrently for about the same period of time .
- the Aggressive-CM is a brute-force technique, since there is no special treatment, Threads greedily attempt to apply the operation, and in case of a rollback, they just discard the changes, and move on to the next poor element to refine (if there is any).
- the purpose of this technique is to show that reducing the number of rollbacks is not just a matter of performance, but a matter of correctness. Indeed, experimental evaluation (see discussion elsewhere herein) shows that Aggressive-CM very often suffers from liveiocks. andom-CM
- Each thread counts the number of consecutive rollbacks r f - . If r ( exceeds a specified upper value r + . then F, sleeps for a random time interval t t , If the consecutive rollbacks break because an operation was successfully finished then n is reset to 0.
- the time interval t t is in milliseconds and is a randomly generated number between 1 and r + , The value of r + is set to 5 , Other values yielded similar results. Note that lower values for r + do not necessarily imply faster executions.
- a low r + decreases the number of rollbacks much more, but increases the number of times that a contented thread goes to sleep (for ⁇ ⁇ milliseconds).
- Random-CM cannot guarantee the absence of liveiocks. This randomness can rarely lead to liveiocks, but it should be rejected as it is not a valid solution. It was also experimentally verified that liveiocks are not that rare (see discussion elsewhere herein).
- GIobal-CM GIobai-CM maintains a global Contention List (CL), If a thread '/'; en-counters a rollback, then it writes its id in CL and it busy waits (i.e., it. blocks). Threads waiting in CL are potentially awaken (in FIFO order) by threads that have made a lot of progress, or in other words, by threads that have not recently encountered many rollbacks. Therefore, each thread ⁇ computes its "progress" by counting how many consecutive successful operations s ; - have been performed without an interruption by a rollbacks. If s ⁇ exceeds a upper value s + , then T ( awakes the first thread in CL, if any. The value for 5 is set to 10. Experimentally, this was found to value yield the best results.
- Giobal-CM can never create Iivelocks. because it is a blocking mechanism as opposed to random -CM which does not block any thread. Nevertheless, the system might end up to a deadlock, because of the interaction with the Load Balancing's Begging List BL (see the Load Balancer discussion above),
- the number of active thread needs to be tracked down, that is, the number of threads that do not busy_wait in either the CL or the Begging List.
- a thread Is forbidden to enter CL and busy_wait, if it sees that there is only one (i.e., itself ) active thread; instead, it skips CL and attempts to refine the next element in its Poor Element List.
- a thread about to enter the Begging List cheeks whether or not it is the only active thread at this moment, in which case, it awakes a thread from the CL, before it starts idling for extra work.
- the local Contention Manager distributes the previously global Contention List (CL) across threads.
- the Contention List CL contains the ids of threads that encountered a rollback because of T, (i.e, they attempted to acquire a vertex already acquired by 7, ⁇ ) and now they busy wait, As above, iff, is doing a lot of progress, i.e., the number of consecutive successful operations exceed s+ , then T. awakes one thread from its local CL; .
- Tj The algorithm in Figure 24 is called by a Tj every time it does not finish the operation successfully (i.e., it encounters a rollback),-
- T attempts to acquire a vertex already locked by Tj (T j ⁇ ⁇ ) ).
- T does not complete the operation, but rather, it rollbacks by disregarding the so far changes, unlocking all the associated vertices, and finally executing the Ro!Jback__Oecured function, with conflict gjd equal to j.
- the conflicting id variables represent dependencies among threads: F ⁇ ⁇ 7 ⁇ ⁇ >Tj ,conflicting_id ⁇ j.
- Rollback Not Occurred then it awakes a thread 7) from its Contention List CL,- by setting Tj 's busy__wait flag to false. Therefore, 7 ⁇ escapes from the loop of Line 1 in Rollback Occurred and is free to attempt the next operation.
- Figure 25 illustrates possible execution scenarios for iocai-CM during six Time Steps. Below, is described in detail what might happen in each step:
- Time Step 2 Ti and T4 attempted to acquire a vertex already owned by T 2 , Both T; and T 4 call the code of Figure 24.
- Their conflicting__id variables represent, those exact dependencies (Line 3 of Rollback Occurred).
- Time Step 6 Here it is shown that TV executed its critical section first, 71 ⁇ 2 executed its critical section second, and T 3 was last. Therefore, TV and TV block, since the condition in Line 6 was false: their conflicting threads at that time had not set their busy_wait to true, The last thread T s realized that its conflicting thread TV has already decided to block, and therefore, T $ returns at Line 10, without blocking.
- Time Step 8 leads the system to the same situation of Time Step 1 : this can be taking place for an undefined period of time with none of the threads making progress.
- Remark I If a deadlock arises, then there has to be a dependency cycle where all the participant threads block. Only then these blocked threads will never be awakened again.
- Remark 2 if a iivelock arises, then there has to be a dependency cycle where ail the participant threads are not blocked. Since all the participant threads break the cycle without making any progress, this "cycle breaking" might be happening indefinitely without completing any operations. In the only case where the rest threads of the system are blocked waiting on these participant threads' Contention Lists (or all the system's threads participate in such a cycle), then system- wide progress is indefinitely postponed.
- the next Lemmas prove that in a dependency cycle, at least one thread will block and at least one thread will not block, This is enough to prove absence of deadlocks and livelocks.
- Lemma 1 (Absence of deadlocks). In a dependency cycle at least one thread will not block.
- Lemma 2 (Absence of livelocks). In a dependency cycle at least one thread will block.
- each CM on the CT abdominal atlas of IRC AD Laparoscopic Center was evaluated using 128 and 256 BSaeklight cores (see Figure 16, Table 2 for its specification).
- the final mesh consists of about 150 X 106 tetrahedra,
- the single-threaded execution time on Blaeklight was 1 ,080 seconds. See Figure 15, Table 1.
- contention overhead time it is the total time that threads spent busy-waiting on a Contention
- ® load balance overhead time it is the total time that threads spent busy- waiting on the
- Begging List waiting for more work to arrive (see discussion elsewhere herein) and accessing the Begging List
- rollback overhead time it is the total time that threads had spent for the partial completion of an operation right before they decided that they had to discard the changes and roll back.
- Random-CM terminated successfully on 128 cores, but it was very slow compared to Global- CM and Local-CM. Indeed, Random-CM exhibits a large number of rollbacks that directly increases both the contention overhead and the rollback overhead. Also, since threads' progress is much slower, threads wai for extra work for much longer, a fact that also increases the load balance overhead considerably. As explained above, Random-CM does not eliminate livelocks, and this is manifested on the 256 core experiment, where a livelock did occur.
- Local-CM performed better. Indeed, observe that the total overhead time is approximately twice as small as Global-CM' s overhead time, This is mainly due to the little contention overhead achieved by Local-CM, Since Global-CM maintains a global Contention List, a thread 7) waits for more time before it gets awaken from another thread for two reasons: (a) because there are more threads in front of 7 ⁇ that need to be awaken first, and (b) because the
- HWS Hierarchical Work Stealing
- HWS Hierarchical Work Stealing scheme
- the Begging List was re- rganized into three levels; BLI , BL2, and BL3. Threads of a single socket that run out of work place themselves into the first le vel begging list BL1 which is shared among threads of a single socket. If the thread realizes that all the other socket threads wait on BL1 , it skips BL1, and places itself to BL2, which is shared among threads of a single blade. Similarly, if the thread realizes that BL2 already accommodates a thread from the other socket in its blade, it asks work by placing itself into the last level begging list BL3.
- BLI is shared among the threads of a single socket and is able to accommodate up to number of threads per socket ⁇ 1 idle threads (in Blacklight, that is 7 threads)
- BL2 is shared among the sockets of a single blade and is able to accommodate up to number of sockets per blade - 1 idle threads (in Biaeklight, that, is I thread).
- BL3 is shared among all the allocated blades and can accommodate at most one thread per blade, in this way, an idle thread tends to take work first from threads inside its socket, if there is none, ⁇ , takes work from a thread of the other socket inside its blade, if any. Finally, if all the other threads inside Ij's blade are idling for extra work, T, places its id to BL3, asking for work from a thread of another blade.
- Figure 17 shows the strong scaling experiment demonstrating both the Random Work Stealing (R.WS) load balance and the Hierarchical Work Stealing (HWS).
- the input image used is the CT abdominal atlas obtained from IRCAD Laparoscopic Center.
- the final mesh generated consists of 124 X 10 6 elements.
- the execution time was 1100 seconds.
- Figure 17(c) shows the breakdown of the overhead time per thread for HWS across runs. Note that since th is is a strong scaling case study, the ideal behavior is a linear increase of the height of the bars with the respect to the number of threads. Observe, however, that the overhead time per thread is always below the overhead time measured on 16 threads. This means that Local-CM and the Hierarchical Work Stealing method (HWS) are able to serve threads fast and tolerate congestion efficiently on runtime.
- HWS Hierarchical Work Stealing method
- the number of tetrahedra created per second across the runs was measured. Specifically, define Elements ( «) and Time (?/), the number of elements created and the time elapsed, when n threads are empioved. Then, the speedup is defined as ⁇ 6 ⁇ . ⁇ - ⁇ — ⁇ . With n threads, a perfect speedup would be equal to n.
- This parameter sets an upper limit on the volume of the tetrahedra generated, With a simple volume argument, one can show that a decrease of ⁇ by a factor of x results in an r' times increase of the mesh size, approximately.
- the Y-axis shows the total number of seconds that threads have spent on useless computation (i.e., rollback, contention, and load balance overhead, see Section 5.5) so far, cumulatively.
- Table 4 shows the performance achieved by the hyper-threaded version of the present code.
- Table 3(a) the same input and parameters were used as the ones used in the experiment shown in Figure 18, Table 3(a). The only difference is that now there are twice as many threads as there were in that Table 3(a).
- hyper-threading achieves a better utilization of the TLB, LLC, and pipeline, there is a considerable slowdown after 64 cores (i.e., 128 hard- ware threads). Observe that hyper-threading expedited the execution for up to 64 cores, indeed, the hyper-threaded version is 47% faster on 64 cores compared to the non hyper- threaded version. Beyond this point, however, there is a considerable slowdown. This slowdown cannot be explained by merely the increase in the number of overhead seconds.
- TetGen is a PLC-based method (see discussion elsewhere herein). That is, TetGen's inputs are triangulated domains that separate the different tissues. For this reason, pass to TetGen the triangulated isosurfaces as recovered by this method, and then let TetGen to fill the underlying volume,
- PI2M, CGAL, and TetGen were run on two different multi-tissue 3D input images obtained from a Hospital Surgical Planning Laboratory. The first is an MR. knee-atlas and the second is a CT head-neck atlas, information about these two inputs is displayed in Figure 27, Table 5. The resulting output meshes generated by the present method PI2M are illustrated in Figure 28.
- Table 6 shows timing and quality statistics for P12 , CGAL, and TetGen. CR.TC (see Figure 16, Table 2 for its specifications) was used for this ease study. The timings reported account for everything but for disk 10 operations. The execution time reported for ⁇ 2 ⁇ incorporates the 1 ,9 seconds and 1.2 seconds time interval needed for the computation of the Euclidean distance transform (see discussion elsewhere herein) for the knee atlas and the head-neck atlas, respectively.
- the sizing parameters of CGAL and TetGen were set to values that produced meshes of similar size to those here, since generally, meshes with more elements exhibit better quality and fidelity,
- the achieved quality of these methods were assessed in terms of radius-edge ratio and dihedral angles. Those metrics are of importance, as they are shown to improve the speed and robustness of medical application solvers dealing with isotropic materials.
- the radius-edge ratio should be low, the minimum dihedral angle should be large, and the maximum dihedral angle should be low.
- the smallest boundary planar angles are also reported. This measures the quality of the mesh boundary, Large smallest boundary planar angles imply better boundary quality.
- PI2M, CGAL, and TetGen allow users to specify the target radius-edge ratio. Apart from TetGen, these methods also allow users to specify the target boundary planar angles. The corresponding parameters were set accordingly, so that the maximum radius-edge ratio is 2 (for PI2M, CGAL, and TetGen), and the smallest boundary planar angle is more than 30° (for PI2M and CGAL only, since TetGen does not give this parameter).
- Fidelity measures how well the mesh boundary represents the isosurfaces. The fidelity achieved by these methods is assessed in terms of the symmetric (double-sided) Hausdorff distance. A low Hausdorff distance implies a good representation. The Hausdorff distance for TetGen is not reported since is the triangular mesh that represents the isosurfaces is given to it as an input.
- TetGen is faster than PI2M only on the knee atlas by a couple of seconds.
- TetGen is slower, Indeed, tor small meshes, the computation of the Euclidean Distance Transform (EDT) accounts for a considerable percentage over the total execution time, a fact that slows down the overall execution time by a lot.
- EDT Euclidean Distance Transform
- the actual meshing time on the knee atlas was just 4.6 sees, ver close to TetGen' s time and rate.
- Another notable observation is thai this method generates meshes with much better dihedral angles and radius- edge ratio than TetGen. The achieved boundary planar angles are similar simply because the PLC' that is given to TetGen was in fact the triangular boundary mesh of ⁇ 2 ⁇ .
- ⁇ 2 ⁇ the first parallel Iniage-to-Mesh (P12M) Conversion Isosurface-based algorithm and its implementation, Starting directly from a multi-label segmented 3D image, it. is abie to recover and mesh both the isosurface do with geometric and topological guarantees (see Theorem 1) and the underlying volume ⁇ with quality elements.
- P12M Iniage-to-Mesh
- Parallel Triangulator tessellate only the convex hull of a set of points
- the present tightly-coupled method greatly reduces the number of rollbacks and scales up to a much higher core count, compared to the tightly-coupled method the present group developed in the past.
- the data decomposition method does not support Delaunay removals, a technique that is shown to be effective in the sequential mesh generation literature.
- the extension of partially-coupled and decoupled methods to 3D is a very difficult task, since Delaunav-admissible 3D medial decomposition is an unsolved problem.
- the present method does not rely on any domain decomposition, and could be extended to arbitrary dimensions as well, indeed, it is planned to extend PI2M to 4 dimensions and generate space- time elements (needed for spatio-temporal simulations) in parallel, thus, exploiting parallelism in the fourth dimension.
- the present approach is highly optimized through carefully designed contention managers, and load balancers which take advantage of NUMA architectures.
- Aspects of the Global Contention Manager (Global-CM) and Local Contention Manager (Local-CM) provably eliminate deadlocks and livelocks. They achieve a speedup even on 256 cores, when other traditional contention managers, found in the mesh generation literature, fail to terminate.
- Local-CM also reduced the number of overhead cycles by a factor of 2 compared to the Global-CM on 256 cores, improving energy- efficiency by avoiding energy waste because of rollbacks.
- the Hierarchical Work Stealing load balancer (HWS) sped up the execution by a factor of 1.45 on 176 cores, as a result of a 22.8% remote accesses reduction.
- PI2M achieves a strong scaling efficiency of more than 82% on 64 cores. It also achieves a weak scaling efficiency of more than 82% on up to 144 cores.
- the inventors are not aware of any 3D parallel Delaunay mesh refinement algorithm achieving such a performance.
- PI2M exhibits excellent single-threaded performance. Despite the extra overhead associated with synchronization, contention management, and load balancing, PI2M generates meshes 40% faster than CGAL and with similar quality. Moreover, P12M achieves better quality than TetGen, and it is also faster than TetGen for large mesh sizes.
- Systems and modules described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.
- Software and other modules may reside on servers, workstations, personal computers, computerized tablets, PDAs, and other devices suitable for the purposes described herein. Software and other modules may be accessible via local memory, via a network, via a browser or other application in an ASP context, or via other means suitable for the purposes described herein.
- Data structures described herein may comprise computer files, variables, programming arrays, programming structures, or any electronic information storage schemes or methods, or any combinations thereof, suitable for the purposes described herein.
- User interface elements described herein may comprise elements from graphical user interfaces, command line interfaces, and other interfaces suitable for the purposes described herein. Except to the extent necessary or inherent in the processes themselves, no particular order to steps or stages of methods or processes described in this disclosure, including the Figures, is implied. In many cases the order of process steps may be varied, and various illustrative steps may be combined, altered, or omitted, without changing the purpose, effect or import of the methods described.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Radiology & Medical Imaging (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Theoretical Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- General Health & Medical Sciences (AREA)
- Marketing (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Operations Research (AREA)
- General Business, Economics & Management (AREA)
- Economics (AREA)
- Data Mining & Analysis (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Software Systems (AREA)
- Apparatus For Radiation Diagnosis (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201261692538P | 2012-08-23 | 2012-08-23 | |
| US61/692,538 | 2012-08-23 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2014032008A2 true WO2014032008A2 (fr) | 2014-02-27 |
| WO2014032008A3 WO2014032008A3 (fr) | 2014-06-05 |
Family
ID=50150504
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2013/056473 Ceased WO2014032008A2 (fr) | 2012-08-23 | 2013-08-23 | Procédé et système pour générer un maillage à partir d'images |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2014032008A2 (fr) |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106548512A (zh) * | 2015-09-22 | 2017-03-29 | 中国石油化工股份有限公司 | 网格模型数据的生成方法 |
| CN108986034A (zh) * | 2018-07-02 | 2018-12-11 | 武汉珞珈德毅科技股份有限公司 | 一种栅格数据坐标转换方法、系统、终端设备及存储介质 |
| CN109636912A (zh) * | 2018-11-27 | 2019-04-16 | 中国地质大学(武汉) | 应用于三维声呐图像重构的四面体剖分有限元插值方法 |
| CN109636913A (zh) * | 2018-12-04 | 2019-04-16 | 山东理工大学 | 基于Delaunay剖分的三角网格增量拓扑拼接方法 |
| CN111047106A (zh) * | 2019-12-23 | 2020-04-21 | 南智(重庆)能源技术有限公司 | 井口阀门寿命预测方法 |
| CN113168920A (zh) * | 2018-10-03 | 2021-07-23 | 制定实验室公司 | 用于处理电子图像以确定用于乳房手术的经修改的电子图像的系统和方法 |
| CN117390639A (zh) * | 2023-10-12 | 2024-01-12 | 巨霖科技(上海)有限公司 | 一种非结构化网格加密方法和系统 |
| CN117688425A (zh) * | 2023-12-07 | 2024-03-12 | 重庆大学 | 面向Non-IID图数据的多任务图分类模型构建方法及系统 |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8302098B2 (en) * | 2007-12-06 | 2012-10-30 | Oracle America, Inc. | Hardware utilization-aware thread management in multithreaded computer systems |
| US20120154397A1 (en) * | 2010-12-03 | 2012-06-21 | Old Dominion University Research Foundation | Method and system for generating mesh from images |
| US20120200566A1 (en) * | 2010-12-03 | 2012-08-09 | Old Dominion University Research Foundation | System and method for mesh refinement |
-
2013
- 2013-08-23 WO PCT/US2013/056473 patent/WO2014032008A2/fr not_active Ceased
Cited By (10)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106548512A (zh) * | 2015-09-22 | 2017-03-29 | 中国石油化工股份有限公司 | 网格模型数据的生成方法 |
| CN108986034A (zh) * | 2018-07-02 | 2018-12-11 | 武汉珞珈德毅科技股份有限公司 | 一种栅格数据坐标转换方法、系统、终端设备及存储介质 |
| CN113168920A (zh) * | 2018-10-03 | 2021-07-23 | 制定实验室公司 | 用于处理电子图像以确定用于乳房手术的经修改的电子图像的系统和方法 |
| CN109636912A (zh) * | 2018-11-27 | 2019-04-16 | 中国地质大学(武汉) | 应用于三维声呐图像重构的四面体剖分有限元插值方法 |
| CN109636913A (zh) * | 2018-12-04 | 2019-04-16 | 山东理工大学 | 基于Delaunay剖分的三角网格增量拓扑拼接方法 |
| CN111047106A (zh) * | 2019-12-23 | 2020-04-21 | 南智(重庆)能源技术有限公司 | 井口阀门寿命预测方法 |
| CN111047106B (zh) * | 2019-12-23 | 2023-04-14 | 南智(重庆)能源技术有限公司 | 井口阀门寿命预测方法 |
| CN117390639A (zh) * | 2023-10-12 | 2024-01-12 | 巨霖科技(上海)有限公司 | 一种非结构化网格加密方法和系统 |
| CN117390639B (zh) * | 2023-10-12 | 2024-03-29 | 巨霖科技(上海)有限公司 | 一种非结构化网格加密方法和系统 |
| CN117688425A (zh) * | 2023-12-07 | 2024-03-12 | 重庆大学 | 面向Non-IID图数据的多任务图分类模型构建方法及系统 |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2014032008A3 (fr) | 2014-06-05 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Foteinos et al. | High quality real-time image-to-mesh conversion for finite element simulations | |
| WO2014032008A2 (fr) | Procédé et système pour générer un maillage à partir d'images | |
| Gueunet et al. | Task-based augmented contour trees with fibonacci heaps | |
| Nave et al. | Guaranteed: quality parallel delaunay refinement for restricted polyhedral domains | |
| Kim et al. | A parallel sparse direct solver via hierarchical DAG scheduling | |
| Gupta et al. | Lock-free pending event set management in time warp | |
| Hong et al. | GPU in-memory processing using spark for iterative computation | |
| Alauzet et al. | On the use of space filling curves for parallel anisotropic mesh adaptation | |
| Antonopoulos et al. | Multigrain parallel delaunay mesh generation: challenges and opportunities for multithreaded architectures | |
| El Gharbi et al. | Two-level substructuring and parallel mesh generation for domain decomposition methods | |
| Barnat et al. | Scalable shared memory LTL model checking | |
| Jiménez et al. | Three‐dimensional thinning algorithms on graphics processing units and multicore CPUs | |
| Chen et al. | Task scheduling for multi-core and parallel architectures | |
| Feng et al. | Two-level locality-aware parallel Delaunay image-to-mesh conversion | |
| Feng et al. | A hybrid parallel Delaunay image-to-mesh conversion algorithm scalable on distributed-memory clusters | |
| Chrisochoides et al. | Towards exascale parallel delaunay mesh generation | |
| Chakroun | Parallel heterogeneous Branch and Bound algorithms for multi-core and multi-GPU environments | |
| WO2014032011A2 (fr) | Procédé et système pour générer un maillage en quatre dimensions à partir d'images | |
| Foteinos | Real-time high-quality image to mesh conversion for finite element simulations | |
| Breitbart | Dataflow-like synchronization in a PGAS programming model | |
| CN105138289A (zh) | 计算组件的存储管理方法和装置 | |
| Mahmoud | Unstructured Geometric Data Processing on the GPU Data Structures & Programming Models | |
| Madhukar | A scalable distributed framework for parallel—adaptive spacetime discontinuous Galerkin solvers with application to multiscale earthquake simulation | |
| Ulrich | Parallel iso-surface extraction and simplification | |
| Byrd | Parallel markov chain monte carlo |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13830635 Country of ref document: EP Kind code of ref document: A2 |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 13830635 Country of ref document: EP Kind code of ref document: A2 |