WO2017210690A1

WO2017210690A1 - Spatial aggregation of holistically-nested convolutional neural networks for automated organ localization and segmentation in 3d medical scans

Info

Publication number: WO2017210690A1
Application number: PCT/US2017/035974
Authority: WO
Inventors: Le LU; Holger Roth; Isabella-Emmanuella NOGUES; Ronald Summers; Xiaosong Wang; Adam P. HARRISON
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-06-03
Filing date: 2017-06-05
Publication date: 2017-12-07
Anticipated expiration: 2018-12-03

Abstract

Disclosed are systems and methods for localization and segmentation of organs (especially abnormally shaped, deformable, and/or smaller organs, such as the pancreas and lymph nodes) based on data from 3D medical imaging (e.g., CT and MRI scans) using holistically-nested convolutional neural networks ("HNNs"). Using as an example CT scan data and the pancreas, the methods can include localizing an organ from an entire 3D CT scan, providing a reliable bounding box for the more refined segmentation step. The methods can further comprise introducing a fully deep-learning approach, based on an efficient application of HNNs on the three orthogonal views. The resulting HNN per-pixel probability maps can then be fused using pooling to reliably produce a 3D bounding box of the pancreas that maximizes the recall. An introduced localizer compares favorably to both a conventional non-deep-learning method and a hybrid approach based on spatial aggregation of superpixels using random forest classification. The segmentation phase can operate within the computed bounding box and can integrate semantic mid-level cues of deeply-learned organ interior and boundary maps, obtained by two additional and separate realizations pf HNNs. By integrating these two mid-level cues, the disclosed methods are capable of generating boundary-preserving pixel-wise class label maps that result in exceptional final organ segmentations.

Description

SPATIAL AGGREGATION OF HOLISTICALLY-NESTED CONVOLUTIONAL NEURAL NETWORKS FOR AUTOMATED ORGAN LOCALIZATION AND SEGMENTATION IN 3D MEDICAL SCANS

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Patent Application No. 62/345,606, filed June 3, 2016, and U.S. Provisional Patent Application No. 62/450,681, filed January 26, 2017, both of which are incorporated by reference herein in their entireties.

FIELD

This application is related to methods and systems for organ localization and segmentation using medical imaging data. BACKGROUND

Segmentation of pancreas, lymph nodes, and other organs in computed tomography (CT) challenges current computer-aided diagnosis (CAD) systems. While automatic segmentation of numerous other organs in CT scans, such as the liver, heart or kidneys, achieves good performance with Dice similarity coefficients (DSCs) of >90%, segmentation of other organs is more difficult, for example the pancreas' variable shape, size, and location in the abdomen limits segmentation accuracy to <73% DSC being reported in the literature. Previous pancreas segmentation work have all been based on performing volumetric multiple atlas registration and executing robust label fusion methods to optimize the per-pixel organ labeling process. This type of organ segmentation strategy is widely used for many organ segmentation problems, such as the brain, heart, lung, and pancreas. These methods can be referred as a top-down model fitting approach, or more specifically, MALF (Multi- Atlas Registration & Label Fusion). Another group of top-down frameworks leverages statistical model detection, e.g., generalized Hough transform or marginal space learning for organ localization; and deformable statistical shape models for object segmentation. However, due to the intrinsic 3D shape variability of the pancreas, lymph nodes, and other organs, statistical shape modeling has not been applied for segmentation of these organs.

SUMMARY

Accurate and automatic organ segmentation from 3D radiological scans is an important yet challenging problem for medical image analysis. Specifically, as a small, soft, and flexible abdominal organ, the pancreas demonstrates very high inter-patient anatomical variability in both its shape and volume. This inhibits traditional automated segmentation methods from achieving high accuracies, especially compared to the performance obtained for other organs, such as the liver, heart or kidneys. To fill this gap, we present an automated system from 3D computed tomography (CT) volumes that is based on a two stage cascaded approach including pancreas localization and pancreas segmentation. For the first step, we localize the pancreas from the entire 3D CT scan, providing a reliable bounding box for the more refined segmentation step. We introduce a fully deep-learning approach, based on an efficient application of holistically-nested convolutional networks (HNNs) on the three orthogonal axial, sagittal, and coronal views. The resulting HNN per-pixel probability maps are then fused using pooling to reliably produce a 3D bounding box of the pancreas that maximizes the recall. We show that our introduced localizer compares favorably to both a conventional non-deep-learning method and a recent hybrid approach based on spatial aggregation of superpixels using random forest classification. The second, segmentation, phase operates within the computed bounding box and integrates semantic mid- level cues of deeply-learned organ interior and boundary maps, obtained by two additional and separate realizations of HNNs. By integrating these two mid-level cues, our method is capable of generating boundary-preserving pixel-wise class label maps that result in exceptional final pancreas segmentations. Quantitative evaluation is performed on a publicly available dataset of 82 patient CT scans using 4-fold cross-validation (CV). We achieve a (mean + std. dev.) Dice similarity coefficient (DSC) of 81.27+6.27% in validation, which significantly outperforms both a previous state-of-the art method and a preliminary version of this work that report DSCs of 71.80+10.70% and 78.01+8.20, respectively, using the same dataset.

The foregoing and other objects, features, and advantages of the disclosed technology will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of the proposed two-stage pancreas localization and segmentation framework. Sec. II-B2 and Sec. II-B3 are the alternative means of bottom-up organ localization. The remaining modules are for pancreas segmentation.

FIG. 2 illustrates the HNN-I/B network architecture for both interior (left images) and boundary (right images) detection pathways. We highlight the error back-propagation paths to illustrate the deep supervision performed at each side-output layer after the corresponding convolutional layer. As the side-outputs become smaller, the receptive field sizes get larger. This allows HNN to combine multi-scale and multi-level outputs in a learned weighted fusion layer. The ground truth images are inverted for aided visualization.

FIG. 3 illustrates a candidate bounding box region generation pipeline (left to right). Gold standard pancreas in red.

FIG. 4 illustrates a candidate bounding box region generation. Gold standard pancreas in red, blobs of > 0:5 probabilities in green, the selected largest 3D connected component in purple, the resulting candidate bounding box in yellow.

FIG. 5 is a multiscale combinatorial grouping (MCG) on three different scales of learned boundary predication maps from HNN-B: Y^B2 side, Y^B3 side, and Y^B fuse using the original CT image on far left as input (with ground truth delineation of pancreas in red). MCG computes superpixels at each scale and produces a set of merged superpixel-based object proposals. We only visualize the boundary probabilities whose values are greater than 0.10.

FIG. 6 (a)-(f) illustrate an example for comparison of regression forest (RF, a-c) and HNN-I (d-f) for pancreas localization. Green and red boxes are ground truth and detected bounding boxes respectively. The green dot denotes the ground truth center. This case demonstrates a case in the 90th percentile in RF localization distance and serves as a representative of poorly performing localization. In contrast, HNN-I includes all of the pancreas with nearly 100% recall in this case.

FIG. 7 (a) and (b) are histogram plots (Y-Axis) of regression forest based bounding boxes (a) and ΗΝΝ-Γ s generated bounding boxes (b) in recalls (X-axis) covering the ground-truth pancreas masks in 3D. Note that Regression Forest produces 16 out of 82 bounding boxes that lie below 60% in pixel-to-pixel recall while HNN-I produces 100% recalls, except for two cases >94.54%.

FIG. 8A-8C illustrate average DSC performance as a function of pancreas probability using HNNmeanmax (8A) and spatial aggregation via RF (8B) for comparison. Note that the DSC performance remains much more stable after RF aggregation with respect to the probability threshold. The percentage of total cases that lie above a certain DSC with RF are shown in 8C: 80% of the cases have a DSC of 78.05%, and 90% of the cases have a DSC of 74.22% and higher.

FIG. 9 (a)-(d) shows examples of our HNN-RF pancreas segmentation results (green) comparing with the ground-truth annotation (red). The best performing case (a), two cases with DSC scores close to the data set mean (b,c) and the worst case are shown (d).

FIG. 10 shows examples of thoracoabdominal lymph node clusters in CT images with ground truth (red) boundaries.

FIG. 11 illustrates disclosed frameworks integrating trained holistically-nested neural networks that capture the interior appearance and boundary cues of the organ to segment, via structured optimization (a) or superpixel based spatial aggregation (b). In (a), three different grid- structured representation and optimization methods are used and evaluated, namely dense CRF, graph cuts (GC) and boundary neural fields (BNF). For dense CRF, the parwise energy terms are not learned but directly computed from the CT intensity contrast and pixel distance measurements. In both GC and BNF, the learned outputs of HNN-B are incorporated into the pairwise interactions between pixels within a large spatial neighborhood, e.g., 20x20. In (b), the spatial aggregation is performed on the enforced boundary-preserving superpixels computed using multiscale boundary maps from HNN-B.

FIG. 12 illustrates, at left, a cropped CT image with a Lymph node located in the center together with its associated HNN-A (middle) and HNN-B before non-maximum suppression (right) maps.

FIG. 13 shows examples of LN CT image segmentation. Top: original CT images with ground truth (red) and BNF segmented (green) boundaries. Bottom: HNN-A LN probability maps. Left and Middle depict successful segmentation results. Right represents an unsuccessful case.

FIG. 14 shows three graphs comparing ground truth and predicted LN volumes, for (a)

HNN-A, (b) boundary neural fields, and (c) graph cuts.

FIG. 15 shows example of lymph node thoracoabdominal CT image segmentation. Top: original CT images with ground truth (red) and BNF segmented (green) boundaries. Bottom: HNN-A lymph node probability maps (red: probability 1; blue: probability 0).

DETAILED DESCRIPTION

I. Introduction

Pancreas segmentation in computed tomography (CT) challenges current computer-aided diagnosis (CAD) systems. While automatic segmentation of numerous other organs in CT scans, such as the liver, heart or kidneys, achieves good performance with Dice similarity coefficients (DSCs) of >90%, the pancreas' variable shape, size, and location in the abdomen limits segmentation accuracy to <73% DSC being reported in the literature. Previous pancreas segmentation work are all based on performing volumetric multiple atlas registration and executing robust label fusion methods to optimize the per-pixel organ labeling process. This type of organ segmentation strategy is widely used for many organ segmentation problems, such as the brain, heart, lung, and pancreas. These methods can be referred as a top-down model fitting approach, or more specifically, MALF (Multi- Atlas Registration & Label Fusion). Another group of top-down frameworks leverages statistical model detection, e.g., generalized Hough transform or marginal space learning for organ localization; and deformable statistical shape models for object segmentation. However, due to the intrinsic huge 3D shape variability of the pancreas, statistical shape modeling has not been applied for pancreas segmentation.

Recently, a new bottom-up pancreas segmentation representation has been proposed, which uses dense binary image patch labeling confidence maps that are aggregated to classify image regions, or superpixels, into pancreas and non-pancreas label assignments. This method's motivation is to improve segmentation accuracy of highly deformable organs, such as the pancreas, by leveraging midlevel visual representations of image segments. This work was advanced further by Roth et al., who proposed a probabilistic bottom- up approach using a set of multi-scale and multi-level deep convolutional neural networks (CNNs) to capture the complexity of pancreas appearance in CT images. The resulting system improved the performance with a reported DSC of 71.8 ±10.7% against 68.8 ±25.6%. Compared to the MALF based pancreas segmentation work that are evaluated using "leave-one-patient-out" (LOO) protocol, the bottom-up approaches using superpixel representation have reported comparable or higher DSC accuracy measurements, under more challenging 6-fold or 4-fold cross-validation. LOO can be considered as an extreme case of M-fold cross-validation with M = N when N patient datasets are available for experiments. When M is decreasing and significantly smaller than N, M-fold CV becomes more challenging since there are less data for training and more patient cases on testing. Comparing the two bottom-up approaches, the usage of deep CNN models has noticeably improved the performance stability, which is evident by the significantly smaller standard deviation than all other top-down or bottom- up works.

Deep CNNs have successfully been applied to many high-level tasks in medical imaging, such as recognition and object detection. The main advantage of CNNs comes from the fact that end-to-end learning of salient feature representations for the task at hand is more effective than handcrafted features with heuristically tuned parameters. Similarly, CNNs demonstrate promising performance for pixel-level labeling problems, e.g., semantic segmentation in recent computer vision and medical imaging analysis work, e.g., fully convolutional neural networks (FCN), DeepLab, and U-Net. These approaches have all garnered significant improvements in performance over previous methods by applying state-of-the-art CNN-based image classifiers and representation to the semantic segmentation problem in both domains.

Semantic organ segmentation involves assigning a label to each pixel in the image. On one hand, features for classification of single pixels (or patches) play a major role, but on the other hand, factors such as edges, i.e., organ boundaries, appearance consistency, and spatial consistency, could greatly impact the overall system performance. Furthermore, there are indications of semantic vision tasks requiring hierarchical levels of visual perception and abstraction. As such, generating rich feature hierarchies for both the interior and the boundary of the organ could provide important "mid-level visual cues" for semantic segmentation. Subsequent spatial aggregation of these mid- level cues then has the prospect of improving semantic segmentation methods by enhancing the accuracy and consistency of pixel-level labeling.

We have also demonstrated that a two-stage bottom-up localization and segmentation approach can improve upon the state of the art. Here, the major extension is that we describe an improved pancreas localization method by replacing the initial super-pixel based one, with a new general deep learning based approach. This methodological component is designed to optimize or maximize the pancreas spatial recall criterion while reducing the non-pancreas volume as much as possible. Specifically, we generate the per-pixel pancreas class probability maps (or "heat maps") through an efficient combination of holistically-nested convolutional networks (HNNs) in the three orthogonal axial, sagittal, and coronal CT views. We fuse the three HNN outputs to produce a 3D bounding box covering the underlying, yet latent in testing, pancreas volume by nearly 100%. In addition, we show that exactly the same HNN model architecture can be effective for the subsequent pancreas segmentation stage by integrating both deeply learned boundary and appearance cues. This also results in a simpler overall pancreas localization and segmentation system using HNNs only, rather than the previous hybrid setup involving non-deep- and deep- learning method components. Lastly, our current method reports an overall improved DSC performance compared to other methods: e.g., DSC of 81.14 plus or minus 7.3% versus 78.0 plus or minus 8.2% and 71.8 plus or minus 10.7%.

The disclosed two-stage process essentially performs 3D spatial aggregation and assembling on the HNN-produced per-pixel pancreas probability maps that run on 2D axial, coronal, and sagittal CT planes. This process operates exhaustively for pancreas localization and selectively for pancreas segmentation. Therefore, this work inherits a hierarchical and compositional visual representation of computing 3D object information aggregated from 2D image slices or parts.

Alternatively, there are recent studies on directly using 3D convolutional neural networks for liver, brain segmentation and volumetric vascular boundary detection. Due to CNN memory restrictions, these 3D CNN approaches adopt padded sliding windows or volumes to process the original CT scans, such as 96x96x48 segments, 160x160x72 subvolumes and 80x80x80 windows, which may cause segmentation discontinuities or inconsistencies at overlapped window boundaries. We argue that learning shareable lower-dimensional 2D CNN models may be more generalizable and handle the "curse-of-dimensionality" issue better than their fully 3D counterparts, especially when used to parse complex 3D anatomical structures, e.g., lymph node clusters and the pancreas. Analogous examples of comparing compositional multi-view 2D CNNs versus direct 3D deep models can be found in other computer vision problems: 1) video based action recognition where a two-stream 2D CNN model, capturing the image intensity and motion cues, significantly improves upon the 3D CNN method; 2) the advantageous performance of multi-view CNNs over volumetric CNNs in 3D shape recognition.

II. Methods

In this work, we present a two-phased approach for automated pancreas localization and segmentation. The pancreas localization step aims to robustly compute a bounding box which, at the desirable setting, should cover the entire pancreas while pruning the high majority volumetric space from any input CT scan without any manual pre-processing. The second stage of pancreas segmentation incorporates deeply learned organ interior and boundary mid- level cues with subsequent spatial aggregation, focusing only on the properly zoomed or cascaded pancreas location and spatial extents that are generated after the first phase. In Sec. II-A we introduce the HNN model that proves effective for both stages. Afterwards, we focus on localization in Sec. II-B, which discusses and contrasts a conventional approach to localization with newer CNN-based ones— a hybrid and a fully deep-learning approach. We show how the latter approach, which relies on HNNs, provides a simple, yet state-of-the-art, localization method. Importantly, it relies on the same HNN architecture as the later segmentation step. With localization discussed, we explain our segmentation approach in Sec. II-C, which relies on combining semantic mid-level cues produced from HNNs. Our approach to organ segmentation is based on simple, reproducible, yet effective, machine-learning principles. In particular, we demonstrate the most effective configuration of our system is simply composed of cascading and aggregating outputs from six HNNs trained at three orthogonal views and two spatial scales. No multi-atlas registration or multi-label fusion techniques are employed. Fig. 1 provides a flowchart depicting the makeup of our system.

A. Learning Mid-level Cues via Holistically-Nested Networks for Localization and

Segmentation

We use the HNN architecture to learn the pancreas' interior and boundary image-labeling maps, for both localization and segmentation. Object-level interior and boundary information are referred to as mid-level visual cues. Note that this type of CNN architecture was first proposed under the name "holistically-nested edge detection" as a deep learning based general image edge detection method. It has been used successfully for extracting "edge-like" structures like blood vessels in 2D retina images. We however would argue and validate that it can serve as a suitable deep representation to learn general raw pixel-in and label-out mapping functions, i.e., to perform semantic segmentation. We use these principles to segment the interior of organs. HNN can address two important issues: (1) training and prediction on the whole image end-to-end, i.e., holistically, using a per-pixel labeling cost; and (2) incorporating multi-scale and multi-level learning of deep image features via auxiliary cost functions at each convolutional layer. HNN computes the image - to- image or pixel-to-pixel prediction maps from any input raw image to its annotated labeling map, building on fully convolutional neural networks and deeply-supervised nets. The per-pixel labeling cost function makes it feasible that HNN/FCN can be effectively trained using only several hundred annotated image pairs. This enables the automatic learning of rich hierarchical feature

representations and contexts that are critical to resolve spatial ambiguity in the segmentation of organs. The network structure is initialized based on an ImageNet pre-trained VGGNet model. Fine-tuning CNNs pre-trained on general image classification tasks is helpful for low-level tasks, e.g., edge detection. Furthermore, we can utilize pre-trained edge-detection networks (trained on BSDS500, for example) to segment organ- specific boundaries.

Network formulation: Our training data

where X„ denotes cropped axial CT images Xn, rescaled to within [0; : : : ; 255] with a soft-tissue window of denote the binary ground truths of the

interior and boundary map of the pancreas, respectively, for any corresponding Xn. Each image is considered holistically and independently. The network is able to learn features from these images alone from which interior and boundary prediction maps can be produced, which we denote as HNN-I and HNN-B, respectively. HNN can efficiently generate multi-level image features due to its deep architecture. Furthermore, multiple stages with different convolutional strides can capture the inherent scales of organ edge/interior labeling maps. However, due to the difficulty of learning such deep neural networks with multiple stages from scratch, we use the pre-trained network fine- tuned to our specific training data sets a relatively smaller learning rate of 10^"6. We use the

HNN network architecture with 5 stages, including strides of 1 , 2, 4, 8 and 16, respectively, and with different receptive field sizes. In addition to standard CNN layers, a HNN network has M side- output layers as shown in Fig. 2. These side-output layers are also realized as classifiers in which the corresponding weights are For simplicity, all standard network layer

parameters are denoted as W. Hence, the following objective function can be defined:

Here, hide denotes an image-level loss function for side-outputs, computed over all pixels in a training image pair X and Y. Because of the heavy bias towards non-labeled pixels in the ground truth data, we apply a strategy to automatically balance the loss between positive and negative classes via a per-pixel class-balancing weight β. This offsets the imbalances between edge/interior (y = 1) and non-edge/exterior (y = 0) samples. Specifically, a class-balanced cross-entropy loss function can be used in Eq. (1) above with j iterating over the spatial dimensions of the image:

Here, β is simply denote the ground truth set

of negatives and positives, respectively. In contrast to examples where β is computed for each training image independently, we use a constant balancing weight computed on the entire training set. This is because some training slices might have no positives at all and otherwise would be ignored in the loss function. The class probability

computed on the activation value at each pixel j using the sigmoid function σ(.). Now, organ edge/interior map predictions Y

can be obtained at each side- output layer, where

are activations of the side-output of layer m. Finally, a "weighted-fusion" layer is added to the network that can be simultaneously learned during training. The loss function at the fusion layer L_fuse is defin

where being the fusion weight.

Dist(., .) is a distance measure between the fused predictions and the ground truth label map. We use cross-entropy loss for this purpose. Hence, the following objective function can be minimized via standard stochastic gradient descent and back prop ag ation :

Testing phase: Given image X, we obtain both interior (HNN-I) and boundary (HNN-B) predictions from the models' side output layers and the weighted-fusion layer:

Here, HNN-I/B( ) denotes the interior boundary prediction maps estimated by the network.

B. Pancreas Localization

Segmentation performance can be enhanced if irrelevant regions of the CT volume are pruned out. Conventional organ localization methods using random forest regression, which we explain in Sec. II-B1, may not guarantee that the regressed organ bounding box contains the targeted organ with extremely high sensitivities on the pixel-level coverage. In Sec. II-B2 we outline a superpixel based approach, based on hand-crafted and CNN features, that is able to provide improved performance. While this is effective, the complexity involved motivates our own development of a simpler and more accessible newly proposed multi-view HNN fusion based procedure. This is explained in Sec. II-B3. The output of the localization method will later feed into a more detailed and accurate segmentation method combining multiple mid-level cues from HNNs as illustrated in Fig. 1.

1) Regression Forest: In object localization by regression, the general idea is to predict an offset vector

for a given image patch I(x) centered about

The predicted object position is then given as ^*

This is repeated for many examples of image patches and then aggregated to produce a final predicted position. Aggregation can be done with non-maximum suppression on prediction voting maps, mean aggregation, cluster medoid aggregation, and the use of local appearance with discriminative models to accept or reject predictions. The pancreas can be localized by regression due to their locations in the body in correlation to other anatomical structures. The objective is to predict bounding boxes is the center of the pancreas and

are the lower and upper corner of the pancreas bounding box respectively. The addition of the extra three parameters follows from the observation that the center of the bounding box is not necessarily the center of the localized object. The pancreas Regression Forest predicts

for a given image patch I(x). This produces pancreas bounding box candidates of the form We additiona

lly use a discriminative model to accept or reject predictions

Finally, accepted predictions are aggregated using non-maximum suppression over probability scores and then the bounding boxes are ranked by the count of accepted predictions within the box. The box with the highest count of predictions is kept as the final prediction. 2) Random Forest on Superpixels: As a form of initialization we alternatively employ a method based on random forest (RF) classification using both hand-crafted and deep CNN derived image features to compute a candidate bounding box regions. We only operate the RF labeling at a low probability threshold of >0.5 which is sufficient to reject the vast amount of non- pancreas from the CT images. This initial candidate generation is sufficient to extract bounding box regions that nearly surround the pancreases completely in all patient cases with -97% recall. All candidate regions are computed during the testing phase of cross-validation (CV). As we will see next, candidate generation can be done even more efficiently by using the same HNN architectures, which are based on convolutional neural networks. Technical details of HNNs are described in Sec. II-A.

3) Multi-view Aggregated HNNs: Alternatively to the candidate region generation process described in Sec. II-B2 that uses hybrid deep and non-deep learning techniques, we employ HNN-I (interior, see Sec. II-A) as a building block for pancreas localization, inspired by the effectiveness of HNN being able to capture the complex pancreas appearance in CT images. This enables us to drastically discard large negative volumes of the CT scan, while operating HNN-I on a conservative probability threshold of >=0.5 that retains high sensitivity/recall (>99%). The constant balancing weight on β during training HNN-I is critical in this step since the high majority of CT slices have empty pancreas appearance and are indeed included for effective training of HNN-I models, in order to successfully suppress the pancreas probability values from appearing in background. Furthermore, we perform a largest connected-component analysis to remove outlier "blobs" of high probabilities. To get rid of small incorrect connections between high-probability blobs, we first perform an erosion step with radius of 1 voxel, and then select the largest connected- component, and subsequently dilate the region again (Fig. 3). HNN-I models are trained in axial, coronal, and sagittal planes in order to make use of the multi-view representation of 3D image context. Empirically, we found a max-pooling operation across the 3D models to give the highest sensitivity/recall while still being sufficient to reject the vast amount of non-pancreas from the CT images (see Table II). One illustrative example is demonstrated in Fig. 4. This initial candidate generation is sufficient to extract bounding box regions that completely surround the pancreases with nearly 100% recall. All candidate regions are computed during the testing phase of cross- validation (CV) with the same split. Note that this candidate region proposal can be important for further processing. It removes "easy" non-pancreas tissue from further analysis and allows HNN-I and HNN-B to focus on the more difficult distinction of pancreas versus its surrounding tissue. The fact that we can use exactly the same HNN model architecture for both stages though is noteworthy.

C. Pancreas Segmentation

With pancreas localized, the next step is to produce a reliable segmentation. Our segmentation pipeline consists of three steps. We first use HNN probability maps to generate mid- level boundary and interior cues. These are then used to produce superpixels, which are then aggregated together into a final segmentation using RF classification.

1) Combining Mid-level Cues via HNNs: We now show that organ segmentation can benefit from multiple mid-level cues, like organ interior and boundary predictions. We investigate deep-learning based approaches to independently learn the pancreas' interior and boundary mid-level cues. Combining both cues via learned spatial aggregation can elevate the overall performance of this semantic segmentation system. Organ boundaries are a major mid- level cue for defining and delineating the anatomy of interest. It could prove to be essential for accurate semantic segmentation of an organ.

2) Learning Organ- specific Segmentation Proposals: Multiscale combinatorial grouping (MCG) is one of the state-of-the-art methods for generating segmentation object proposals in computer vision. We utilize this approach to generate organ- specific superpixels based on the learned boundary predication maps HNN-B. Superpixels are extracted via continuous oriented watershed transform at three different scales, denoted

supervisedly learned by HNN-B. This allows the computation of a hierarchy of superpixel partitions at each scale, and merges superpixels across scales, thereby efficiently exploring their combinatorial space. This, then, allows MCG to group the merged superpixels toward object proposals. We find that the first two levels of object MCG proposals are sufficient to achieve -88% DSC (see Table IV and Fig. 5), with the optimally computed superpixel labels using their spatial overlapping ratios against the segmentation ground truth map. All merged superpixels S from the first two levels are used for the subsequent spatial aggregation step. Note that HNN-B can only be trained using axial slices where the manual annotation was performed. Pancreas boundary maps in coronal and sagittal views can display strong artifacts.

3) Spatial Aggregation with Random Forest: We use the superpixel set S generated previously to extract features for spatial aggregation via random forest classification. Within any superpixel

we compute simple statistics including the lst-4th order moments, and 8 percentiles on the CT intensities, and a per-pixel element-wise pooling

function of multi-view HNN-Is and HNN-B. Additionally, we compute the mean x, y, and z coordinates normalized by the range of the 3D candidate region (Sec. II-B3). This results in 39 features describing each superpixel and are used to train a RF classifier on the training positive or negative superpixels at each round of 4-fold CV. Empirically, we find 50 trees to be sufficient to model our feature set. A final 3D pancreas segmentation is simply obtained by stacking each slice prediction back into the original CT volume space. No further post-processing is employed. This complete pancreas segmentation model is denoted as HNN-RF.

III. Experimental Results

A. Data

Manual tracings of the pancreas for 82 contrast-enhanced abdominal CT volumes are provided by a publicly available dataset, for the ease of comparison. Our experiments are conducted on random splits of -60 patients for training and -20 for unseen testing, in 4-fold cross-validation throughout in this section, unless otherwise mentioned.

B. Evaluation

We perform extensive quantitative evaluation on different configurations of our method and compare to the previous state-of-the-art work with in-depth analysis.

1) Localization: From our empirical study, the candidate region bounding box generation based on multi-view max-pooled HNN-Is (Sec. II-B3) or previous hybrid methods (Sec. II-B2) works comparably in terms of addressing the requirement to produce spatially-truncated 3D regions that maximally cover the pancreas in the pixel-to-pixel level and reject as much as possible the background spaces. An average reduction of absolute volume of 90.36% (range [80.45%- 96.26%]) between CT scan and candidate bounding box is achieved during this step, while keeping a mean recall of 99.93%, ranging [94.54%-100.00%] Table I shows the test performance of pancreas localization and bounding box prediction using regression forests in DSC and average Euclidean distance against the gold standard bounding boxes. As illustrated in Fig. 7, regression forest based localization generates 16 out of 82 bounding boxes that lie below 60% in the pixel-to- pixel recall against the ground-truth pancreas masks. Nevertheless we obtain nearly 100% recall for all scans (except for two cases >94.54%) through the multi-view max-pooled HNN-Is. An example of detected pancreas can be seen in Fig. 6.

TABLE I: Test performance of pancreas localization and bounding box prediction using regression forests in Dice and average Euclidean distance against the gold standard bounding boxes, in 4-fold cross validation.

TABLE II: Four-fold cross-validation: DSC [%] pancreas segmentation performance of various spatial aggregation functions on AX, CO, and SA viewed HNN-I probability maps the candidate region generation stage (the best results in bold).

TABLE III: Four-fold cross-validation: DSC [%] pancreas segmentation performance of various spatial aggregation functions on AX, CO, and SA viewed HNN-I probability maps in the second cascaded stage (the best results in bold).

2) HNN Spatial Aggregation for Pancreas Segmentation: The interior HNN models trained on the axial (AX), coronal (CO) or sagittal (SA) CT images in Sec. II-B3 can be straightforwardly used to generate pancreas segmentation masks. We exploit different spatial aggregation or pooling functions on the AX, CO, and SA viewed HNN-I probability maps, denoted as AX, CO, SA (any single view HNN-I probability map simply used); mean(AX,CO), mean(AX,SA), mean(CO,SA) and mean(AX,CO,SA) (element-wise mean of two or three view HNN-I probability maps); max(AX,CO,SA) (elementwise maximum of three view HNN-I probability maps); and finally meanmax(AX,CO,SA) (element-wise mean of the maximal two scores from three view HNN-I probability maps). After the optimal thresholding calibrated using the training folds on these pooled HNN-I maps, the resulting binary segmentation masks are further refined by 3D connected component process and simple morphological operations (as in Sec. II- B3). Table II demonstrates the DSC pancreas segmentation accuracy performance by investigating different spatial aggregation functions. We observe that the element- wise multi-view (mean or max) pooling operations on HNN-I probabilities maps generally outperform their single view counterparts. max(AX,CO,SA) performs slightly better than mean(AX,CO,SA). The configuration of meanmax(AX,CO,SA) produces the most superior performance in mean DSC which may behave as a robust fusion function by rejecting the smallest probability value and averaging the remained two HNN-I scores per pixel location. After the pancreas localization stage, we train a new set of multi-view HNN-Is with the spatially truncated scales and extents. This serves a desirable "Zoom Better to See Clearer" effect for deep neural network segmentation models [46] where cascaded HNN-Is only focus on discriminating or parsing the remained organ candidate regions. Similarly, DSC [%] pancreas segmentation accuracy results of various spatial aggregation or pooling functions on AX, CO, and SA viewed HNN-I probability maps (trained in the second cascaded stage) are shown in Table III. We find consistent empirical observations as above when comparing multi-view HNN pooling operations. The meanmax(AX,CO,SA) operation again reports the best mean DSC performance at 81.14% which is increased considerably from 76.79% in Table II. We denote this system configuration as HNNmeanmax. This result validates our two staged pancreas segmentation framework of proposing candidate region generation for organ localization followed by "Zoomed" deep HNN models to refine segmentation. Table IV shows the improvement from the meanmax-pooled HNN-Is (i.e., HNNmeanmax) to the HNN-RF based spatial aggregation, using DSC and average minimum surface-to-surface distance (AVGDIST). The average DSC is increased from 81.14% to 81.27%, However, this improvement is not statistically significantly with p > 0:05 using Wilcoxon signed rank test. In contrast, using dense CRF (DCRF) optimization (with HNN-I as the unary term and the pairwise term depending on the CT values) as a means of introducing spatial consistency does not improve upon HNN-I noticeably). Comparing to the performance of other state-of-the-art methods at mean DSC scores of 71.4% and 78.01 % respectively, both variants of HNNmeanmax and HNN-RF demonstrate superior quantitative segmentation accuracy in DSC and AVGDIST metrics. We have the following two observations. 1 , The main performance gain (similar to HNNAX in Table III) is found by the multi-view aggregated HNN pancreas segmentation probability maps (e.g., HNNmeanmax), which also serve in HNN-RF.

The new candidate region bounding box generation method (Sec. II-B3) works comparably to the hybrid technique (Sec. II-B2) based on our empirical evaluation. However the proposed pancreas localization via multi-view max-pooled HNNs greatly simplified our overall pancreas segmentation system which may also help the generality and reproducibility. The variant of HNNmeanmax produces competitive segmentation accuracy but merely involves evaluating two sets of multi-view HNN-Is at two spatial scales: whole CT slices or truncated bounding boxes. There is no need to compute any handcrafted image features or train other external machine learning classifiers. As shown in Fig. 7, the conventional organ localization framework using regression forest does not address well the purpose of candidate region generation for segmentation where extremely high pixel-to-pixel recall is required since it is mainly designed for organ detection. In Table V, the quantitative pancreas segmentation performance of two method variants, HNNmeanmax, HNN-RF spatial aggregation, are evaluated using four metrics of DSC (%), Jaccard Index (%), Hausdorff distance (HDRFDST [mm]) and AVGDIST [mm]. Note that there is no statistical significance when comparing the performance of two variants in three measures of DSC, JACARD, and AVGDIST, except for HDRFDIST with p < 0:001 under Wilcoxon signed rank test. Since Hausdorff distance represents the maximum deviation between two point sets or surfaces, this observation indicates that HNN-RF may be more robust than HNNmeanmax in the worst case scenario.

Pancreas segmentation on illustrative patient cases are shown in Fig. 9. Furthermore, we applied our trained HNNI model on a different CT data set with 30 patients, and achieve a mean DSC of 62.26% without any re-training on the new data cases, but if we average the outputs of our 4 HNN-I models from cross-validation, we achieve 65.66% DSC. This demonstrates that HNN-I may be highly generalizable in cross-dataset evaluation. Performance on that dataset will likely improve with further fine-tuning. Last, we collected an additional dataset of 19 unseen CT scans using the same patient data protocol. Here, HNN meanmax achieves a mean DSC of 81.2%.

IV. Discussion and Conclusion

To the best of our knowledge, our result comprises the highest reported average DSC in testing folds under 4-fold CV evaluation metric. Strict comparison to most other methods is not directly possible due to different datasets utilized. Our holistic segmentation approach with multi- view pooling and spatial aggregation advances the current state-of-the-art quantitative performance to an average DSC of 81.27% in testing. Previous notable results for CT images

range from -68% to -78%, all under the "leave-one -patient-out" (LOO) cross-validation scheme. In particular, DSC drops from 68% (150 patients) to 58% (50 patients). Our methods also perform with the better statistical stability, i.e., comparing 7.3% or 6.27% versus 18.6% and 15.3% in the standard deviation of DSC scores. The minimal DSC values are 44.69% with HNNmeanmax and 50.69% for HNN-RF whereas others report patient cases with DSC <10%. Recent work that explores the direct application of 3D convolutional filters with fully convolutional architectures also shows promise. 2D or 3D implementations may be more suited for certain tasks. Deep networks representations with direct 3D input suffer from the curse-of dimensionality and are more prone to overfitting. Volumetric object detection might require more training data and might suffer from scalability issues. However, proper hyper-parameter tuning of the CNN architecture and enough training data (including data augmentation) might help eliminate these problems. In the meantime, spatial aggregation in multiple 2D views (as disclosed herein) can be a very efficient

(and computationally less expensive) way of diminishing the curse-of-dimensionality. Furthermore, using 2D views has the advantage that networks trained on much larger databases of natural images (e.g. ImageNet, BSDS500) can be used for fine tuning to the medical domain. Transfer learning is a viable approach when the medical imaging data set size is limited. 3D CNN approaches can adopt padded spatially-local sliding volumes to parse any CT scan, e.g., 96x96x48, 160x160x72 or 80x80x80, which may cause the segmentation discontinuity or inconsistency at overlapped window boundaries. Ensemble of several neural networks trained with random configuration variations is found to be advantageous comparing a single CNN model in object recognition. Our pancreas segmentation method can be indeed considered as ensembles of multiple correlated HNN models but good complementary information gain since they are trained from orthogonal axial, coronal or sagittal CT views.

In conclusion, we present a holistic deep CNN approach for pancreas localization and segmentation in abdominal CT scans, exploiting multi-view spatial pooling and combining interior and boundary mid-level cues. The robust fusion of HNNmeanmax aggregating on interior holistically- nested networks (HNN-I) alone already achieve good performance at DSC of 81.14%±7.30% in 4- fold CV. The other method variant HNN-RF incorporates the organ boundary responses from the HNN-B model and significantly improves the worst case pancreas segmentation accuracy in Hausdorff distance (p<0.001). The highest reported DSCs of 81.27%+6.27% is achieved, at the computational cost of 2-3 minutes, not hours. Our deep learning based organ segmentation approach could be generalizable to other segmentation problems with large variations and pathologies, e.g., pathological organs and tumors.

TABLE IV: Four-fold cross-validation: The DSC [%] and average surface-to-surface minimum distance (AVGDIST [mm]) performance, HNNmeanmax, HNN-RF spatial aggregation, and optimally achievable superpixel assignments (italic). Best performing method in bold.

TABLE V: Four-fold cross-validation: The quantitative pancreas segmentation performance results of our two method variants, HNN_meanmax, HNN-RF spatial aggregation, in four metrics of DSC (%), Jaccard Index (%), Hausdorff distance (HDRFDST [mm]), and AVGDIST [mm]. Best performing methods are shown in bold. Note that there is no statistical significance when comparing the performance by two variants in three measures of DSC, JACARD, and AVGDIST, except for HDRFDIST with p < 0.001 (Wilcoxon Signed Rank Test). This indicates that HNN-RF may be more robust than HNN_meanmax in the worst case scenario. Exemplary benefits and advantages of the disclosed technology can include:

1) A unified deep convolutional neural network framework for fully-automated localization and segmentation of highly-deformable or variable organs given an input CT volume.

2) Proposing using simple, effective and novel multiple view 2D holistically-nested neural networks to aggregate a reliable 3D organ segmentation confidence map.

3) Random Forest and image region segmentation information aggregation and decision integration.

4) Significantly better quantitative performance for the pancreas (one of the most difficult organs to segment) from CT imaging, compared to all known state-of-the-art approaches, with strong clinical indications and impacts.

V. Applications for Lymph Nodes and Other Organs

Lymph node segmentation is also an important challenge in medical image analysis. The presence of enlarged lymph nodes (LNs) signals the onset of progression of a malignant disease or infection. In the thoracoabdominal (TA) body region, neighboring enlarged LNs often spatially collapse into swollen lymph node clusters (LNCs) (up to 9 LNs in our data sets). Accurate segmentation of TA LNCs is complexified by the noticeably poor intensity and texture contrast among neighboring LNs and surrounding tissues, and has not been addressed adequately before.

LN segmentation and volume measurement play a crucial role in important medical imaging base4d diagnosis tasks, such as quantitatively evaluating disease progression and the effectiveness of a given treatment or therapy. Enlarged LNs of greater than 10 mm on a CT slice signal the onset or progression of a malignant disease or an infection. Often performed manually, LN segmentation is highly complex, tedious and time consuming. For example, weak intensity contrast renders the boundaries of distinct agglomerated LNs ambiguous, as shown in FIG. 10.

The methods disclosed herein provide a fully-automated method for TA LNC segmentation. More importantly, the segmentation task is formulated as a flexible, bottom-up image binary classification problem that can be effectively solved using deep CNNs and graph-based structured optimization and inference. This disclosed methods can handle all variations in LNC size and spatial configuration. Furthermore, the methods disclosed herein are well-suited for measuring agglomerated LNs, whose ambiguous boundaries compromise the accuracy of diameter measurement.

The segmentation framework for lymph nodes can be similar to what is described elsewhere herein for the pancreas and other organs. More information regarding applications to lymph node segmentation can be found in U.S. Provisional Patent Application No. 62/345,606, filed June 3, 2016, which is incorporated by reference herein.

FIG. 11 illustrates disclosed frameworks integrating trained holistic ally-nested neural networks that capture the interior appearance and boundary cues of the organ to segment, via structured optimization (a) or superpixel based spatial aggregation (b). In (a), three different grid- structured representation and optimization methods are used and evaluated, namely dense CRF, graph cuts (GC) and boundary neural fields (BNF). For dense CRF, the parwise energy terms are not learned but directly computed from the CT intensity contrast and pixel distance measurements. In both GC and BNF, the learned outputs of HNN-B are incorporated into the pairwise interactions between pixels within a large spatial neighborhood, e.g., 20x20. In (b), the spatial aggregation is performed on the enforced boundary-preserving superpixels computed using multiscale boundary maps from HNN-B.

FIG. 14 shows three graphs comparing ground truth and predicted LN volumes, for (a) HNN-A, (b) boundary neural fields, and (c) graph cuts. FIG. 15 shows example of lymph node thoracoabdominal CT image segmentation. Top: original CT images with ground truth (red) and BNF segmented (green) boundaries. Bottom: HNN-A lymph node probability maps (red: probability 1; blue: probability 0). A computer or other processing system comprising a processor and memory, such as a personal computer, a workstation, a mobile computing device, or a networked computer, can be used to perform the methods disclosed herein, including any combination of CT or MR imaging acquisition, imaging processing, imaging data analysis, data storage, and output/display of results (e.g., segmentation maps, etc.). The computer or processing system may include a hard disk, a removable storage medium such as a floppy disk or CD-ROM, and/or other memory such as random access memory (RAM). Computer-executable instructions for causing a computing system to execute the disclosed methods can be provided on any form of tangible and/or non-transitory data storage media, and/or delivered to the computing system via a local area network, the Internet, or other network. Any associated computing process or method step can be performed with distributed processing. For example, extracting information from the imaging data and determining producing segmentation maps can be performed at different locations and/or using different computing systems.

For purposes of this description, certain aspects, advantages, and novel features of the embodiments of this disclosure are described herein. The disclosed methods, apparatuses, and systems should not be construed as limiting in any way. Instead, the present disclosure is directed toward all novel and nonobvious features and aspects of the various disclosed embodiments, alone and in various combinations and sub-combinations with one another. The methods, apparatuses, and systems are not limited to any specific aspect or feature or combination thereof, nor do the disclosed embodiments require that any one or more specific advantages be present or problems be solved.

Characteristics and features described in conjunction with a particular aspect, embodiment, or example of the disclosed technology are to be understood to be applicable to any other aspect, embodiment or example described herein unless incompatible therewith. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive. The invention is not restricted to the details of any embodiments disclosed in this application. The invention extends to any novel one, or any novel combination, of the features disclosed in this application, or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

Although the operations of some of the disclosed methods are described in a particular, sequential order for convenient presentation, it should be understood that this manner of description encompasses rearrangement, unless a particular ordering is required by specific language. For example, operations described sequentially may in some cases be rearranged or performed concurrently. Moreover, for the sake of simplicity, the figures of this application may not show the various ways in which the disclosed methods can be used in conjunction with other methods.

In view of the many possible embodiments to which the principles of the disclosed technology may be applied, it should be recognized that the illustrated embodiments are only examples and should not be taken as limiting the scope of the disclosure. Rather, the scope of the disclosure is at least as broad as the following claims. We therefore claim all that comes within the scope of the following claims.

Claims

CLAIMS:

1. A method for localization and segmentation of organs based on data from 3D medical imaging, the method comprising:

receiving 3D imaging data for a patient including a target organ, the 3D imaging data including three orthogonal axial, sagittal, and coronal views;

localizing the target organ from 3D imaging data;

applying holistically-nested convolutional neural networks ("HNNs") on the three orthogonal axial, sagittal, and coronal views to produce per-pixel probability maps;

fusing the probability maps using pooling to produce a 3D bounding box of the target organ;

integrating semantic mid-level cues of deeply-learned organ interior and boundary maps, obtained by two additional and separate realizations of HNNs; and

based on the integration of the mid- level cues, generating boundary-preserving pixel-wise class label maps that result in final segmentations of the target organ.

2. The method of claim 1, wherein the target organ is a pancreas.

3. The method of claim 1, wherein the target organ is a lymph node.

4. The method of claim 1, wherein the target organ comprises a cluster of lymph nodes.

5. The method of any one of claims 1-4, wherein the 3D imaging data is derived from one or more computerized tomography scans.

6. The method of any one of claims 1-5, wherein the method comprises utilizing multiscale combinatorial grouping (MCG) to generate target organ- specific superpixels based on learned boundary maps.

7. The method of any one of claims 1-6, wherein superpixels are extracted via continuous oriented watershed transform at three different scales supervisedly learned by HNN boundary.

8. The method of claim 7, further comprising computation of a hierarchy of superpixel partitions at each scale, and merger of superpixels across scales.

9. The method of claim 8, further comprising grouping merged superpixels into superpixel sets and using the superpixel sets for a subsequent spatial aggregation via random forest classification.

10. The method of claim 9, further comprising using superpixels sets to generate features describing each superpixel using said features to train a random forest classifier on training positive or negative superpixels.

11. The method of claim 10, further comprising obtaining a final 3D organ segmentation by stacking slice predictions back into an original CT volume space.

12. A computing system comprising a processor and memory, the system operable to implement the method of any one of claims 1-11.

13. A system comprising:

a 3D imaging system operable to receiving 3D imaging data for a patient including a target organ, the 3D imaging data including three orthogonal axial, sagittal, and coronal views; and

a computing system comprising a processor, memory, and software, the computing system operable to:

localize the target organ from 3D imaging data;

apply holistically-nested convolutional neural networks ("HNNs") on the three orthogonal axial, sagittal, and coronal views to produce per-pixel probability maps;

fuse the probability maps using pooling to produce a 3D bounding box of the target organ;

integrate semantic mid-level cues of deeply-learned organ interior and boundary maps, obtained by two additional and separate realizations of HNNs; and

based on the integration of the mid-level cues, generate boundary-preserving pixel- wise class label maps that result in final segmentations of the target organ.

14. The system of claim 13, wherein the target organ is a pancreas or a lymph node.

15. The system of claim 13 of claim 14, wherein the 3D imaging system comprises a computerized tomography system and the 3D imaging data is derived from one or more computerized tomography scans.

16. The system of any one of claims 13-15, wherein the computing system is operable to utilize multiscale combinatorial grouping to generate target organ-specific superpixels based on learned boundary maps.

17. The system of any one of claims 13-16, wherein the computing system is operable to extract superpixels via continuous oriented watershed transform at three different scales supervisedly learned by HNN boundary.

18. The system of claim 17, wherein the computing system is operable to compute a hierarchy of superpixel partitions at each scale, and merge superpixels across scales.

19. The system of claim 18, wherein the computing system is operable to group merged superpixels into superpixel sets and using the superpixel sets for a subsequent spatial aggregation via random forest classification.

20. The system of claim 19, wherein the computing system is operable to use superpixel sets to generate features describing each superpixel using said features to train a random forest classifier on training positive or negative superpixels.

21. The system of claim 20, wherein the computing system is operable to obtain a final 3D organ segmentation by stacking slice predictions back into an original CT volume space.

22. One or more non-transitory computer readable media storing computer-executable instructions, which when executed by a computer cause the computer to perform the method of any one of claims 1-11.