WO2025166017A1

WO2025166017A1 - Interpretable sequential multiple-instance learning for medical imaging

Info

Publication number: WO2025166017A1
Application number: PCT/US2025/013804
Authority: WO
Inventors: Michael Lingzhi LI; Hsin-Hsiao Scott WANG; Xiaolong LUO
Original assignee: Boston Childrens Hospital; Harvard University
Current assignee: Boston Childrens Hospital; Harvard University
Priority date: 2024-01-31
Filing date: 2025-01-30
Publication date: 2025-08-07
Anticipated expiration: 2026-07-31

Abstract

A significant challenge in medical diagnostics is the often- sequential nature of various diagnostic modalities, for example, x-rays and ultrasounds. Systems and methods for using a machine learning model to analyze a sequence of scans to account for sequential information are provided herein. A system may receive a sequence of scans. The system may determine a subsequence of scans including a subset of scans from the sequence of scans. The system may determine an incremental prediction, using a machine learning model, based on the subsequence of scans. When the system determines all scans of the sequence are included in the subsequence, the system may determine an overall prediction based on one or more of the incremental predictions.

Description

INTERPRETABLE SEQUENTIAL MULTIPLE-INSTANCE LEARNING FOR MEDICAL IMAGING CROSS-REFERENCE TO RELATED APPLICATIONS This Application claims the benefit under 35 U.S.C. § 119(e) of U.S. Application Serial No.63/627,119, filed January 31, 2024, and entitled "INTERPRETABLE SEQUENTIAL MULTIPLE-INSTANCE LEARNING FOR MEDICAL IMAGING" which is hereby incorporated by reference herein in its entirety. This application also claims the benefit under 35 U.S.C. § 119(e) of U.S. Application Serial No.63/640,680, filed April 30, 2024, and entitled “INTERPRETABLE SEQUENTIAL MULTIPLE-INSTANCE LEARNING FOR MEDICAL IMAGING” which is hereby incorporated by reference herein in its entirety. BACKGROUND Various medical imaging modalities exist. The resulting images are often used in diagnosing medical conditions, monitoring performance of a medical procedure, and/or evaluating treatment of medical conditions. SUMMARY A significant challenge in medical diagnostics is the often-sequential nature of various diagnostic modalities, for example, x-rays and ultrasounds. Sequential multiple instance learning may provide a robust method for analyzing medical images to account for the sequential information in sequences of medical images. According to an aspect of the present application, a method of using multiple instance learning to analyze medical images is provided. The method may comprise: receiving a first subsequence of medical images representing a subsequence of a first sequence of medical images; determining, with a processor implementing a machine learning model, a first incremental prediction based on the first subsequence of medical images; receiving an additional medical image of the sequence of medical images such that that additional medical image and the first subsequence of medical images together form a second subsequence of medical images of the sequence of medical images; determining, with the processor implementing the machine learning model, a second incremental prediction based on the second subsequence of medical images; and determining a medical diagnosis based on the sequence of medical images. 12246005.5 According to an aspect of the present application, a computer readable storage medium storing processor executable instructions which, when executed, cause a processor to perform a method of using multiple instance learning to analyze medical images is provided. The processor executable instructions, when executed, may cause the processor to perform the method comprising: receiving a first subsequence of medical images representing a subsequence of a first sequence of medical images; determining, with a processor implementing a machine learning model, a first incremental prediction based on the first subsequence of medical images; receiving an additional medical image of the sequence of medical images such that that additional medical image and the first subsequence of medical images together form a second subsequence of medical images of the sequence of medical images; determining, with the processor implementing the machine learning model, a second incremental prediction based on the second subsequence of medical images; and determining a medical diagnosis based on the sequence of medical images. BRIEF DESCRIPTION OF DRAWINGS Various aspects and embodiments of the application will be described with reference to the following figures. It should be appreciated that the figures are not necessarily drawn to scale. Items appearing in multiple figures are indicated by the same reference number in all the figures in which they appear. FIG.1A shows incremental outputs of a machine learning model using the enhanced Multiple-Instance Learning (MIL) framework described herein for several sequences of scans, according to some embodiments. FIG.1B shows an example sequence of scans and incremental outputs using the enhanced MIL framework, according to some embodiments. FIG.1C shows an example method for analyzing a sequence of scans using an enhanced machine learning framework, according to some embodiments. FIG.2 shows an example machine learning model architecture to perform two stream embedding of a sequence of scans having sequential information, according to some embodiments. FIG.3A shows a sample uncertainty comparison between incremental predictions of multiple sequences of scans of a urinary tract dilation (UTD) dataset, according to some embodiments. 12246005.5 FIG.3B demonstrates the performance enhancement (Accuracy) of an example two stream model on various datasets after discarding a proportion of predictions with the highest uncertainty, according to some embodiments. FIG.4A displays the example model’s performance across a UTD dataset, according to some embodiments. FIG.4B displays the example model’s performance across the 2019 Radiological Society of North America (RSNA) challenge dataset, according to some embodiments. FIG.5 illustrates a computing system which may operate in connection with aspects of the present disclosure. DETAILED DESCRIPTION A significant challenge in medical diagnostics is the often-sequential nature of various diagnostic modalities, for example, x-rays and ultrasounds. These tend to produce extensive sequences of medical scans associated with a singular diagnostic label and the volume of these scans can differ substantially across patients and diagnostic modalities. One non-limiting example that illustrates the point comes from context of a pediatric urological ultrasound dataset, for which sequences of scans may range from 1 scan to hundreds of scans. Conventional machine learning models that are tailored for fixed input sizes are rendered ineffective for diagnostic analysis of these scans. Moreover, the ability of a doctor or other evaluator to reach a conclusion or diagnosis can benefit from the information conveyed by a sequence of images rather than based only on individual images. The changes and similarities between images constituting the sequence may convey information to the evaluator that is not reflected by considering only the individual image(s) themselves. Accordingly, the inventors have developed techniques and methods for performing analysis of medical scans using a machine learning model that accounts for the sequential nature of certain medical diagnostic modalities. In some aspects of the technology described herein, an enhanced machine learning framework to accommodate scenarios where the diagnostic modality possesses sequential information is provided. For example, the machine learning framework may accommodate diagnostic modalities that possess sequential information from successive scans taken over a period of time. In some examples, the machine learning framework may accommodate 12246005.5 diagnostic modalities that possess sequential information from scans taken over different regions of a patient or target, for example, scans taken at different depths of a region of interest in the patient. In some examples, the machine learning framework may account for sequential information from sequential scans by implementing an incremental training mechanism and/or an incremental prediction mechanism to capture the sequential relationship between scans. In addition to considering the sequential nature of certain imaging modalities, the systems and methods described herein may also account for the reverse-invariant nature of certain scans. The inventors have recognized and appreciated that at least some scan sequences may be reverse-invariant, meaning that a diagnosis reached from analyzing the sequence should be the same whether looking at a sequence of scans from scan 1 to scan N or from scan N to scan 1. Thus, according to at least some aspects described herein, a machine learning architecture accounts for the reverse-invariant characteristic of the sequential information. The architecture may be configured to process a first data stream analyzing a sequence of scans in a first direction and a second data stream analyzing the sequence of scans in a second direction. In some aspects described herein, a method of validating a machine learning model prediction accounting for sequential information is provided. The validation of a machine learning model may use a sequence-based uncertainty metric that accounts for the sequential predictions of a machine learning model. The aspects and embodiments described above, as well as additional aspects and embodiments, are described further below. These aspects and/or embodiments may be used individually, all together, or in any combination of two or more, as the application is not limited in this respect. MACHINE LEARNING FRAMEWORK As described above, many diagnostic modalities often produce a scan or set of scans that possess sequential information. The scan or set of scans can vary substantially by volume, and/or size of scans across patients and modalities. As such, conventional machine learning models tailored for fixed input sizes may be ineffective. Conventionally, a weakly supervised learning strategy for machine learning models known as Multiple Instance Learning (MIL) has been used to account for the large volume variations of diagnostic scanning between patients. MIL considers groups (sometimes called “bags”) of scans at a single time. The group of scans may include successive scans over time, or successive scans over different regions of a patient or 12246005.5 target, for example, at different depths. Further, in the case of a histopathology slide having high, gigapixel resolutions, the group of scans may include the histopathology slide divided into smaller tiles. However, the conventional MIL framework fails to account for sequential connections between scans within a sequence of scans. Instead, conventional MIL techniques typically rely on a single scan within a sequence of scans to make a prediction. The traditional MIL assumption categorizes problems into binary classifications with positive and negative groups of scans, and typically assumes an absence of interactions between scans in a group. For example, a sequence is labeled positive if any of the scans is positive, and labeled negative only if all the scans are negative. Accordingly, the inventors have developed systems and techniques for an enhanced MIL framework that accounts for sequential information. In some aspects of the present technology, a method is provided for training and using a machine learning model having an enhanced MIL framework. In some embodiments, a machine learning model may receive as input, a sequence of scans, the sequence having sequential information represented by the scans in the sequence. The sequential information may include the successive differences of one or more features of each scan between successive scans in the sequence, as represented by the image data in each scan and its position in the sequence. For example, the successive differences may show how a particular feature in the scans progresses over successive time periods of the scans, or over successive depths of the scans. The sequence of scans may include scans taken successively over time, over space (e.g., adjacent regions, different depths, etc.), and/or sub-scans divided from a large individual scan. The succession of scans in the sequence may be arbitrary. For example, in a sequence of scans taken successively over time, the times represented between scans may range from a few seconds to a few months. The sequence may represent a single diagnostic scan that iteratively scans a patient every few seconds or may represent multiple diagnostic scans taken successively over the course of a disease of a patient. The sequence of scans may include ultrasound scans, computerized tomography (CT) scans, magnetic resonance imaging (MRI) scans, or any other suitable imaging modality. During the training phase of the enhanced MIL framework, the sequence of scans may be labeled with a single diagnostic training label for the entire sequence of scans. In some embodiments, the machine learning model may be a deep learning model, a neural network, or any other suitable machine learning model for analyzing images and medical images. 12246005.5 In some embodiments, the sequence of scans may be incrementally used by the machine learning model to generate a prediction. For example, the machine learning model may first analyze a first subsequence of scans from the sequence of scans. The first subsequence may include only a first scan of the sequence of scans or may include multiple scans from the sequence of scans. The machine learning model may determine a first prediction based on the first subsequence of scans. The machine learning model may then analyze a second subsequence of scans from the sequence of scans. The second subsequence may include the first scan or multiple scans of the first subsequence of scans and may further include one or more additional scans from the sequence of scans. The machine learning model may determine a second prediction based on the second subsequence of scans. For example, the sequence of scans may include scans {x₁, ... , x_n}, and the machine learning model may determine predictions {p₁, ... , pn}. The first subsequence may include only {x1} or may include {x1… xi} and the second subsequence may include {x₁, x₂} or {x₁ … x_i+1}. For each subsequence i the machine learning _{model g(X; θ) : X → [0, 1] may determine prediction pi and incremental difference δi.} Based on the predictions for each subsequence, the machine learning model may further classify the sequence of scans as either positive (e.g., a patient has the disease in question) or negative (e.g., a patient does not have the disease in question), although positive and negative are used for example purposes only and any classification or regression indicating the overall prediction may be used. The determined incremental differences may indicate which subsequences, and in turn, which additional scans from the sequence of scans, contribute most to the overall prediction. For example, the incremental difference between a first subsequence and a second subsequence may indicate the impact of the additional scan(s) in the second subsequence versus the first subsequence. If the incremental difference is near 1 (or -1), the additional scan(s) of the second subsequence may have a greater impact on the overall prediction than the scans in the first subsequence. Alternatively, an incremental difference near 0 may indicate a stable prediction between subsequences and the additional scan(s) in the second subsequence may not have a large impact on the overall prediction. In that way, the machine learning model may determine the proper prediction as well as which scans in the sequence of scans contributes most to the prediction accounting for the sequential information of the sequence of scans. 12246005.5 FIG.1A shows incremental outputs of a machine learning model using the enhanced MIL framework for eight sequences of scans. Four example sequences, 110, 112, 114, and 116 result in a negative overall prediction, and four example sequences, 100, 102, 104, and 106, result in a positive overall prediction. Each sequence of the eight example sequences includes between four and sixteen scans. Each point on the graph shows the incremental output (e.g., incremental prediction) determined by a machine learning model for each subsequence of the sequence of scans. The x-axis depicts the number of scans within the subsequence (e.g., instance number 2 may be a subsequence including the first two scans from the sequence), and the y-axis depicts the output of the incremental prediction as a value between 0.0 and 1.0. As such, the lines in the graph show the impact on the incremental predictions made by each additional scan from the sequence added to the subsequence. As shown, an incremental output may initially begin low, for example, as depicted with sequence 100, representing that the first scan or first few scans of the first subsequence may not contribute much to the overall prediction. Rather, the second subsequence increases the incremental output of sequence 100, representing that the second scan or second subsequence of scans contributes significantly to the final output. As the ultimate endpoint for sequence 100 is near 1.0, the sequence of scans may thus be labeled positive as having a particular feature. Sequence 110 depicts a scenario where all the scans in the sequence of scans are negative and thus, no significant increase is seen over the incremental predictions. Sequence 110 may thus ultimately be labeled negative as not having a particular feature. Although only these two scenarios – sequences which consistently generate a negative prediction and sequences which initially generate a negative prediction but as more scans are added to the sequence generate a positive prediction – are shown in FIG.1A, this is for example purposes only. In some examples, the first incremental prediction may be large, indicating a high potential for a positive output. The second incremental prediction may then decrease significantly, indicating that some artifact in the first subsequence may not be relevant to the output when looking at the sequential information between scans. FIG.1B shows the sequence of scans for sequence 100 and the incremental outputs using the enhanced machine learning model, according to some embodiments. For example, the first subsequence of scans may include only scan 120. The machine learning model determines that scan 120 is negative, having a low attention value and incremental prediction. The second subsequence may include scan 120 and scan 122. The machine learning model may determine 12246005.5 that scan 122 may include the feature and the subsequence is determined to have a high attention value and high incremental prediction. The process may iterate for the rest of the subsequences until all subsequences and scans have been analyzed, the last subsequence having all the scans from the sequence of scans. The final incremental prediction for scan 100 being 0.887, the machine learning model may determine that the sequence of scans shows the presence of a particular feature or disease. FIG.1C shows an example method 130 for analyzing a sequence of scans using an enhanced machine learning framework, according to some embodiments. Method 130 may begin at 132 by receiving a sequence of scans. As described above, the sequence of scans may be from any suitable diagnostic modality and may include scans representing a sequence of scans over time, space, or sub-scans divided from a single scan. Method 130 may proceed to determine a subsequence of scans at step 134 including a subset of scans from the sequence. The number of scans from the sequence of scans in the subsequence may be any suitable number. For example, the subsequence may include only one scan (e.g., the first scan) from the sequence of scans, two scans (e.g., the first two scans) from the sequence of scans, or any suitable number of scans from the sequence of scans. As described further below, during training of the machine learning model, it may be beneficial for the machine learning model to be exposed to all scans of the sequence. As such, the subsequence of scans may include half or more of the scans in the sequence of scans. Method 130 may then determine at step 136 an incremental prediction, using a machine learning model, based on the subsequence of scans. The incremental prediction may represent whether the subsequence of scans indicates a diagnostic feature. For example, the incremental prediction may include a numerical value indicating whether the subsequence of scans indicates a diagnostic feature. The numerical value may be within the range of 0 to 1, where 1 indicates a positive result where the subsequence indicates a diagnostic feature, and 0 indicates a negative result where the subsequence does not indicate the diagnostic feature. As such, where the incremental prediction includes a numerical value near 1, the incremental prediction may indicate a high likelihood that the subsequence of scans indicates a diagnostic feature, and where the numerical value is near 0, the incremental prediction may indicate a low likelihood that the subsequence of scans indicates the diagnostic feature. Optionally, method 130 may determine, concurrently or subsequently, an incremental difference between the current incremental prediction determined in step 136 for a second subsequence and a previous incremental 12246005.5 prediction determined in step 136 for a first subsequence. In that way, method 130 may determine the incremental difference between the incremental predictions to indicate the impact that the additional scan(s) in the second subsequence may have on the overall prediction determined in step 140B as described further below. Method 130 may determine whether all the scans of the sequence are in the subsequence at decision block 139. If method 130 determines that the subsequence includes less than all the scans of the sequence of scans, method 130 may proceed to step 140A to update the subsequence of scans. In some examples, updating the subsequence of scans may including adding the next subsequent scan from the sequence of scans to subsequence. For example, if the subsequence includes the first three scans from the sequence of scans, updating the subsequence of scans may include adding the fourth scan from the sequence of scans. After updating the subsequence of scans, method 130 may proceed back to step 136 and determine an incremental prediction based on the updated subsequence of scans. As such, method 130 may iteratively add scans to the subsequence and determine incremental predictions at each iterated subsequence of scans. In that way, the incremental predictions determined with method 130 may indicate the degree to which one or more scans contribute to an overall prediction. For example, a jump up in incremental predictions from a near-0 value to a near-1 value, or vice versa, between subsequences may indicate that the newly added scan contributes more to the overall prediction. Similarly, stable incremental predictions between subsequences may indicate that the newly added scan contributes less to the overall prediction. When method 130 determines all scans of the sequence of scans are in the subsequence, method 130 may proceed to step 140B to determine an overall prediction based on one or more of the incremental predictions. In some examples, the overall prediction may include a numerical value as described above with respect to the incremental predictions determined in step 136. The overall prediction may indicate whether the sequence of scans indicates a particular diagnostic feature. By using a series of incremental predictions to determine an overall prediction, the methods described herein (e.g., method 130 above) may provide both the overall prediction of whether the sequence of scans indicates a particular diagnostic feature and information indicative of the sequential information between scans. The incremental predictions may indicate whether the associated subsequence indicates the diagnostic feature of interest. The incremental differences may then indicate the impact that the associated subsequence has on the 12246005.5 overall prediction based on the sequential information. For example, a large incremental difference may indicate that the addition of scan(s) in the associated subsequence contributes more to the overall prediction than a small incremental difference. As such, the methods described herein may account for sequential information in a sequence of scans. TWO STREAM TRANSFORMER As described above, sequential information encoded in a sequence of scans may have the characteristic of being reverse-invariant, meaning the same result should be obtained when looking at the scans in a first direction versus a second direction. Accordingly, the inventors have developed systems and techniques for accounting for this reverse-invariant characteristic. In some instances, a machine learning model with a two-stream embedding may be provided. In some embodiments, the machine learning model may include a feature extractor, a positional encoding layer, and an attention module. In some embodiments, the feature extractor may be configured to receive a sequence of scans as input and preprocess the scans in the sequence of scans. The feature extractor may include a plurality of layers configured to receive the sequence of scans and perform the preprocessing. For example, the feature extractor may have an input layer configured to receive the sequence of scans. The feature extractor may have one or more layers configured to perform preprocessing on the sequence of scans (e.g., convolution layer, max pooling layer, etc.). In some embodiments, the positional encoding layer may be configured to receive the sequence of preprocessed scans from the feature extractor and further process the sequence of preprocessed scans so that the machine learning model accounts for the reverse-invariant characteristic of the sequence. In some embodiments, the positional encoding layer may perform dual positional embedding. The positional encoding layer may perform dual positional embedding by embedding positional information of the scans in the sequence. For example, the positional encoding layer may perform linear embedding to capture the sequential nature of the sequence of scans and may use a second embedding technique (e.g., Gaussian embedding) to capture the reverse-invariant characteristic. Although linear and Gaussian embeddings are used in the example below, it should be appreciated that any pair of types of embeddings may be used to capture both the sequential information and reverse-invariant characteristics. To provide an example, the positional encoding layer may construct a vector P ∈ ℝ^N×2 for a sequence comprising N scans. The first dimension of this vector may signify linear positional encoding, 12246005.5 while the second dimension may be gaussian positional encoding. The mathematical expressions for the encodings may be as follows: ^ _{^^,^^^^^^ =} _{^ × !} _{^^^^^ − 1} where s is the scaling factor. In some embodiments, the positional encoding vector P may be concatenated at each transformation stage so that the resulting output feature vector has dimensions N x (d + 2) where d is the feature dimensionality. This concatenation may reinforce the sequential and reverse-invariant information to amplify the machine learning model’s comprehension of the sequential information. The attention module may then be configured to receive the sequence of preprocessed scans from the positional encoding layer and analyze the sequence of preprocessed scans to determine whether a particular feature is seen (e.g., whether a patient has a particular disease). The attention module may have a first stream for analyzing the forward sequence (e.g., linear embedding of the sequence) and a second stream for analyzing the reverse sequence (e.g., Gaussian embedding of the sequence). The outputs of each stream may be an incremental prediction, or feature vector which can then be used to determine a prediction for the sequence or subsequence of scans. For example, the attention module may be configured to return one or more outputs that can be used to determine incremental predictions as described above with respect to FIGS.1A-1C. The attention module may include components configured to perform scan-level attention analyzing the features in an individual scan of the sequence. Sequence-level attention may be performed by the iterative method of incremental prediction as described with respect to FIGS.1A-1C. The inventors have recognized and appreciated the challenge of effectively leveraging a sequence’s label in conjunction with the bidirectional analysis. The inventors have recognized that the model should be exposed to all instances within the sequence of scans for a particular label during training. In some embodiments, the first stream and the second stream may receive a subsequence of scans cropped from the sequence of scans in opposite directions. In some embodiments, the crop may be based on a ratio (e.g., 50%). For example, if the crop ratio is 50%, the first stream would get the first half of the sequence of scans and the second stream 12246005.5 would get the second half of the sequence. The outputs of the two streams may then be integrated. For example, the final training loss may be a composite of Weighted Incremental loss for both the forward and reverse sequences and binary cross-entropy (BCE) loss according to the _{following equations:} FIG.2 shows an example machine learning model architecture to perform two-stream embedding of a sequence of scans having sequential information. The machine learning model architecture may have a feature extractor 202, a positional encoding layer 204, and an attention module 206. In some embodiments, feature extractor 202 may be configured to receive a sequence 100 of scans as input and preprocess the scans in sequence 100. Feature extractor 202 may have a plurality of layers (202A, 202B, 202C) configured to preprocess the scans in sequence 100. In some examples, feature extractor 202 may have an input layer 202A, hidden layer(s) 202B, and output layer 202C. For example, layers 202A-202C may include one or more of a convolutional layer, a max pooling layer, and/or any other suitable layer. In some embodiments, positional encoding layer 204 may be configured to receive the preprocessed sequence 100 and further process the preprocessed sequence 100 so that the machine learning model accounts for the reverse-invariant characteristic of the sequence. In some embodiments, positional encoding layer 204 may perform dual positional embedding. For example, positional encoding layer 204 may perform linear embedding to capture the sequential nature of the sequence of scans and may use a second embedding technique (e.g., Gaussian embedding) to capture the reverse-invariant characteristic. Positional encoding layer 204 may include a first transformer encoder 204A configured to perform linear embedding of the sequence in a first direction (e.g., from scan 1 to scan N, front sequence 100A) and a second transformer encoder 204B configured to perform the second embedding technique on the sequence in a second direction (e.g., from scan N to scan 1, reverse sequence 100B) to capture the reverse-invariant characteristic. In some embodiments, attention module 206 may then be configured to receive the preprocessed front sequence 100A and reverse sequence 100B from positional encoding layer 204 and analyze the sequence of preprocessed scans to determine whether a particular feature is seen (e.g., whether a patient has a particular disease). The attention module may have a first 12246005.5 stream 206A for analyzing the front sequence 100A (e.g., linear embedding of the sequence) and a second stream 206B for analyzing the reverse sequence 100B (e.g., Gaussian embedding of the sequence). For example, first stream 206A may include an instance-level attention component 208A and a positional attention block 210A, and second stream 206B may include instance-level attention component 208B and positional attention block 210B. The outputs 214A and 214B of each stream 206A-206B may be a prediction, or vector of incremental predictions (e.g., vectors 212A and 212B), which can then be processed to determine an overall prediction for the sequence of scans. For example, the attention module may be configured to determine incremental predictions as described above with respect to FIGS.1A-1C. Stream 206A may determine a first output 214A indicative of a prediction based on the sequence being analyzed in the first direction (e.g., front sequence 100A). Stream 206B may similarly determine a second output 214B indicative of a prediction based on the sequence being analyzed in the second direction (e.g., reverse sequence 100B). Both first and second outputs 214A and 214B may then be compared or otherwise aggregated to determine a prediction 220 for sequence 100. For example, first output 214A and second output 214B may be averaged to determine prediction 220. In that way, prediction 220 may account for both the sequential information of sequence 100 and the reverse invariant characteristic of sequence 100. In some examples, during training, sequence 100 may be labelled with a single label for the entire sequence 100. During training, first stream 206A may receive a first cropped sequence from sequence 100 and the second stream 206B may receive a second cropped sequence from sequence 100 as discussed above. The outputs from each stream 206A-206B may be used to calculate the weighted incremental loss for the forward (forward weighted incremental loss 216A) and reverse (reverse weighted incremental loss 216B) sequences. The outputs may similarly be combined and averaged to determine a BCE loss. The total loss, as described above, may be a composite of the forward and reverse weighted incremental losses 216A and 216B and the binary cross entropy (BCE) loss 222. The total loss may be used to update the model to provide more accurate predictions. By using a two-stream transformer architecture as described above, the architecture may account for the reverse-invariant characteristic of certain scan sequences. A first stream of the two-stream transformer architecture may analyze a sequence or subsequence of scans in a first direction (e.g., from scan 1 to scan N) and a second stream may analyze the sequence or subsequence of scans in a second direction (e.g., from scan N to scan 1). The outputs of the two 12246005.5 streams may be compared, or otherwise aggregated (e.g., averaged) to determine if the analysis satisfies the reverse-invariant characteristic of the scan sequence. By analyzing the sequence or subsequence in both directions, the two-stream transformer architecture may account for and validate that the analysis of the sequence satisfies the reverse-invariant characteristic of the certain scan sequence. MIL Uncertainty metric The inventors have further recognized and appreciated that including analysis of sequential information in a sequence of scans introduces additional variability across different sequences. As such, the inventors have developed validation techniques that account for the variability observed in incremental predictions across different sequences. In some aspects of the technology, a method of validating a machine learning model accounting for the variability across sequences is provided. In some embodiments, the method of validating the machine learning model may include determining an uncertainty metric. In some embodiments, the uncertainty metric may be determined by determining a dispersion-based uncertainty and an MIL-based uncertainty. In some embodiments, the dispersion-based uncertainty may be determined based on the variability and convergence of the machine learning model’s output. For example, a first metric including the standard deviation may be determined. The standard deviation may be given by: _{A second metric including the central part fluctuation range may be given by:} _{\ = max (^_^` ) − min (^_^` ).} Here, ^̅ may represent the mean output probability, and pmid refers to the central portion of the incremental outputs, typically formed by excluding one prediction from both ends. Both S and R may reflect the dispersion characteristics within the sequence of incremental predictions, capturing both the variability and convergence of a model’s output and providing insights into the model’s partial uncertainty regarding its predictions. In some embodiments, the MIL-based uncertainty may be determined based on the average distance of incremental predictions to the ground truth (e.g., 0 for negative and 1 for 12246005.5 positive), and the sequence’s first-order difference. The average distance to the ground truth may reflect the model’s average level of certainty for each subsequence. The average distance to ground truth may be given by: ^ _{∗ 1} _{a =} _{^ T min (|^_^ − 0|, |1 − ^_^ |)} _^H^ where si represents a weighting operation based on softmax functions and τ is a temperature parameter. The first-order difference may reflect the convergence of the model’s predictions where rapid convergence results in a small value, and vice versa. The first-order difference may be given by: In some embodiments, the four uncertainty metrics described above may be combined to produce an overall uncertainty metric for the machine learning model. In some embodiments, the k_lmn may be determined as a mean distance to the decision boundaries (e.g., ground truth values) when there are only one or two scans in the sequence. For more scans in the sequence, the k_lmn may be determined as: k_{lmn = N × 8$ + \ × 8^ + a × 8` + C × 8o} where w represents the corresponding weights determined through a cross-validation procedure. For example, the model and metric may be cross-validated on a training set to maximize the 12246005.5 accuracy of the model’s predictions when a percentage (e.g., 1%, 5%, 10%, 20%…) of the most uncertain samples are removed. Examples Having described various aspects of the technology, below are provided some non- limiting examples of implementations of aspects of the present technology. First, an example algorithm for implementing one or more (including all in some embodiments) of the above- described aspects of the technology is provided: UTD Classification Dataset: Urinary tract dilation (UTD) is a notable medical condition that has drawn considerable attention, primarily because of its frequent detection in approximately 1-2% of prenatal ultrasound (US) examinations (Chow et al., 2017; Nguyen et al., 2022). The UTD dataset comprises data from 1,186 patients. Each patient’s sequence of scans 12246005.5 forms a bag, and the number of scans per patient varies. The bag lengths in this dataset range widely from 1 to 59 scans. Both the training and testing sets have an average bag length of around 10 scans. For our experiments, 70% of the data was designated for training, 20% for testing, and the remaining 10% for validation (Same train-test splitting for other datasets). RSNA Dataset: The RSNA dataset is obtained from the 2019 Radiological Society of North America (RSNA) challenge, encompassing 50,862 CT slices from 1,200 patients. Following the preprocessing protocol established in (Wu et al., 2021), each CT slice was subjected to three distinct window settings applied to the original Hounsfield Units. This process, akin to the practice of radiologists, involves adjusting the window Width (W) and Center (C) to enhance the visualization of specific tissues in brain CTs. The chosen settings were brain (W: 80, C:40), subdural (W:200, C:80), and soft tissue (W:380, C: 40). Subsequently, all images were resized to a uniform dimension of 224 × 224 pixels and normalized within the range [0, 1]. Annotations were provided at both the individual slice and overall scan levels, and only scan-level labels were utilized in the experiment. SARS-CoV-2 CT-Scan Dataset: In response to the COVID-19 global health crisis, the SARS-CoV-2 CT-Scan Dataset has been introduced as an essential tool for medical research and diagnostic advancements (Soares et al., 2023). This dataset incorporates 4,173 CT scans from 210 unique patients, which includes 2,168 scans from 80 patients with verified COVID-19 infections. It also contains scans that represent a range of pulmonary conditions, as well as scans from individuals without any diagnosed condition. For the purposes of study, all patients were categorized into two groups—healthy and diseased—and aggregated their images into bags, resizing each image to 224 by 224 pixels. The enhanced MIL framework model is evaluated against leading benchmarks, including SA-DMIL (Wu et al., 2023), and commonly used MIL aggregators such as max pooling (Shen et al., 2021) and attention-MIL pooling (Ilse et al., 2018). The results in Table 1 below are based on the best hyperparameter settings as reported in the original publications. Performance metrics such as Accuracy (Acc), Precision (Pre), Recall (Rec), and F1 Score (F1) are reported for the UTD and RSNA datasets. (Averaged over five random seeds) For the Cov-2 CT dataset, which is relatively smaller in size, only Accuracy is reported. The results indicate that the enhanced MIL framework model as described herein achieves strong performance on multiple datasets, highlighting the importance of leveraging sequential information. Furthermore, the advantage of the two-stream model over single stream models suggests that learning from bidirectional sequential information can further improve the effectiveness of the model. 12246005.5 Table 1. Further presented herein is the value of sequence information by demonstrating the effectiveness of sequence-based uncertainty, constructed based on incremental prediction sequences. FIG.3A illustrates how sequence-based uncertainty adeptly reflects fluctuations within incremental predictions, revealing that sequences with greater volatility and slower convergence exhibit higher uncertainty. These fluctuations often correspond to instances that are challenging to classify, aligning with clinical assessments. FIG.3A shows a sample uncertainty comparison between incremental predictions of multiple sequences of scans 300, 302, 304, and 306, according to some embodiments. The sequences with greater fluctuation and slower convergence may exhibit higher uncertainty scores (e.g., sequence 302), as indicated by the SMILU scores in FIG.3A. Accordingly, the methods and techniques described herein may further provide information indicative of the difficulty to classify certain sequences of scans that may have weaker signals. FIG.3B demonstrates the performance enhancement (Accuracy) on various datasets after discarding a proportion of predictions with the highest uncertainty. Notably, on the UTD dataset 310, accuracy improves from 93% to 96% after excluding the top 20% of the most uncertain predictions. Similar significant improvements are observed on the other datasets (Covid dataset 314, and RSNA dataset 312) when employing the uncertainty metric, affirming the efficacy of our proposed sequence-based uncertainty approach. As depicted in FIG.3B, different stippling is used to differentiate between the UTD dataset 310, RSNA dataset 312, and Covid dataset 314. In addition to performance metrics, the model may exhibit desirable characteristics such as rapid convergence in predictions and invariance to sequence reversal. Notably, most models, due to their lack of consideration for sequential information and reliance on a feature extractor- aggregator architecture, possess reverse-invariance. Having already established the significance of sequential information as described above, aspects of the present technology maintain robustness against transformations such as cropping and reversing. FIG.4A displays the performance of a two-stream model consistent with aspects of the present technology (400) 12246005.5 across the example UTD dataset, as compared to conventional methods. FIG.4B displays the performance of the two-stream model according to aspects of the present technology (410) across the RNSA dataset, as compared to conventional methods. FIGS.4A and 4B demonstrate that with the incorporation of sequential information, a two stream MIL model maintains robustness to crop and reverse transformations, as evidenced by the results. The conventional methods depicted in FIGS.4A and 4B include Smooth Attention Deep Multiple Instance Learning (SA-DMIL) (402, 412), Attention-based Deep Multiple Instance Learning (ADMIL) (404, 414), and MaxPooling (406, 416). Each method is differentiated in FIGS.4A and 4B by utilizing different stippling patterns. FIG.5 illustrates a computer system 500 which may operate in connection with aspects of the present disclosure, including performing the techniques described herein in at least some embodiments. The computer system 500 may be configured to perform various methods and acts as described in FIGs.1A-4B. The computer system 500 may include on or more processors 510, and one or more non-transitory computer-readable storage media (e.g., memory 520 and one or more non-volatile storage media 530) and a display 540. The processor 510 may control writing data to and reading data from the memory 520 and the non-volatile storage device 530 in any suitable manner, as the aspects of the invention described herein are not limited in this respect. In some embodiments, the computer system 500 may also be a complete system on module (SOM), which include CPU, GPU, memory, and any other components in a system. In other variations, the system may not need to include a memory, but instead programming instructions are running on one or more virtual machines or one or more containers on a Cloud. For example, the various methods illustrated above may be implemented by a server on a Cloud that includes multiple virtual machines, each virtual machine having an operating system, a virtual disk, virtual network and applications, and the programming instructions for performing any of the methods or implementing any of the techniques described herein. To perform functionality and/or techniques described herein, the processor 510 may execute one or more instructions stored in one or more computer-readable storage media (e.g., the memory 520, storage media, etc.), which may serve as non-transitory computer-readable storage media storing processor-executable instructions for execution by the processor 510. In connection with techniques described herein, code used to perform any of the methods described herein may be stored on one or more computer-readable storage media of computer system 500 and processor 510 may execute any such code. Any other software, programs or 12246005.5 instructions described herein may also be stored and executed by computer system 500. It will be appreciated that computer code may be applied to one or more aspects of methods and techniques described herein. The indefinite articles “a” and “an,” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.” The phrase “and/or,” as used herein in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. As used herein in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements. The terms “approximately” and “about” may be used to mean within ±20% of a target value in some embodiments, within ±10% of a target value in some embodiments, within ±5% of a target value in some embodiments, and yet within ±2% of a target value in some embodiments. The terms “approximately” and “about” may include the target value. Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having,” 12246005.5 “containing,” “involving,” and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Having described above several aspects of at least one embodiment, it is to be appreciated various alterations, modifications, and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be object of this disclosure. Accordingly, the foregoing description and drawings are by way of example only. 12246005.5

Claims

CLAIMS What is claimed is: 1. A method of using multiple instance learning to analyze medical images, comprising: receiving a first subsequence of medical images representing a subsequence of a first sequence of medical images; determining, with a processor implementing a machine learning model, a first incremental prediction based on the first subsequence of medical images; receiving an additional medical image of the sequence of medical images such that that additional medical image and the first subsequence of medical images together form a second subsequence of medical images of the sequence of medical images; determining, with the processor implementing the machine learning model, a second incremental prediction based on the second subsequence of medical images; and determining a medical diagnosis based on the sequence of medical images.

2. The method of claim 1, wherein the sequence of medical images represents a time sequence of medical images.

3. The method of claim 1, wherein the sequence of medical images represents a spatial sequence of medical images.

4. The method of claim 1, wherein the sequence of medical images comprises ultrasound images, computed tomography images, or magnetic resonance images. 12246005.5

5. The method of claim 1, further comprising forming multiple subsequences of medical images beyond the first subsequence and second subsequence, the method further comprising determining an incremental prediction for each subsequence of the multiple subsequences.

6. The method of claim 1, wherein determining the first incremental prediction comprises performing a dual embedding with the processor implementing the machine learning model.

7. The method of claim 6, wherein performing the dual embedding comprises embedding positional information of each medical image in the first subsequence.

8. The method of claim 6, wherein performing the dual embedding comprises performing a linear embedding and a second embedding that is not a linear embedding.

9. The method of claim 8, wherein the second embedding is a Gaussian embedding.

10. The method of claim 6, wherein performing the dual embedding comprises determining two embeddings to account for a reverse-invariant nature of the first subsequence of medical images.

11. The method of claim 1, further comprising generating an uncertainty metric relating to the first incremental prediction. 12246005.5

12. A computer readable storage medium storing processor executable instructions which, when executed, cause a processor to perform a method of using multiple instance learning to analyze medical images, the method comprising: receiving a first subsequence of medical images representing a subsequence of a first sequence of medical images; determining, with a processor implementing a machine learning model, a first incremental prediction based on the first subsequence of medical images; receiving an additional medical image of the sequence of medical images such that that additional medical image and the first subsequence of medical images together form a second subsequence of medical images of the sequence of medical images; determining, with the processor implementing the machine learning model, a second incremental prediction based on the second subsequence of medical images; and determining a medical diagnosis based on the sequence of medical images.

13. The computer readable storage medium of claim 12, wherein the sequence of medical images represents a time sequence of medical images.

14. The computer readable storage medium of claim 12, wherein the sequence of medical images represents a spatial sequence of medical images.

15. The computer readable storage medium of claim 12, wherein the sequence of medical images comprises ultrasound images, computed tomography images, or magnetic resonance images. 12246005.5

16. The computer readable storage medium of claim 12, wherein the method further comprises forming multiple subsequences of medical images beyond the first subsequence and second subsequence, the method further comprising determining an incremental prediction for each subsequence of the multiple subsequences.

17. The computer readable storage medium of claim 12, wherein determining the first incremental prediction comprises performing a dual embedding with the processor implementing the machine learning model.

18. The computer readable storage medium of claim 17, wherein performing the dual embedding comprises embedding positional information of each medical image in the first subsequence.

19. The computer readable storage medium of claim 17, wherein performing the dual embedding comprises performing a linear embedding and a second embedding that is not a linear embedding.

20. The computer readable storage medium of claim 19, wherein the second embedding is a Gaussian embedding.

21. The computer readable storage medium of claim 17, wherein performing the dual embedding comprises determining two embeddings to account for a reverse-invariant nature of the first subsequence of medical images. 12246005.5

22. The computer readable storage medium of claim 12, further comprising generating an uncertainty metric relating to the first incremental prediction. 12246005.5