WO2025059185A1

WO2025059185A1 - Reliability assessment analysis and calibration for artificial intelligence classification

Info

Publication number: WO2025059185A1
Application number: PCT/US2024/046208
Authority: WO
Inventors: Farshid ALAMBEIGI; Siddhartha KAPURIA; Sandeep Chinchali
Original assignee: University of Texas System; University of Texas at Austin
Current assignee: University of Texas System; University of Texas at Austin
Priority date: 2023-09-11
Filing date: 2024-09-11
Publication date: 2025-03-20
Anticipated expiration: 2026-03-11

Abstract

Disclosed herein is a system and method for the calibration of image classification from machine learning models and interactive artificial intelligence systems based thereon. An example of an interactive AI system using the same is disclosed. The system and method may provide a means for user interaction with the system and method.

Description

Attorney Docket No.10046-568WO1 RELIABILITY ASSESSMENT ANALYSIS AND CALIBRATION FOR ARTIFICIAL INTELLIGENCE CLASSIFICATION Related Applications [0001] This application claims priority to, and the benefit of, U.S. Provisional Patent Application No.63/537,651, filed September 11, 2023, entitled “RELIABILITY ASSESSMENT ANALYSIS AND CALIBRATION FOR ARTIFICIAL INTELLIGENCE CLASSIFICATION,” which is hereby incorporated by reference herein in its entirety. Technical Field [0002] This disclosure generally relates to the calibration of machine learning classifications and natural language-based artificial intelligence systems. Background [0003] The use of computer-aided diagnostics and artificial intelligence (AI) techniques has proliferated in the past decades and has demonstrated great potential in image and data classification applications. Various classification models can be developed using various known machine learning or AI architectures and take as input any number of images or data (e.g., medical images, engineering data), in many cases leading to black box techniques where it is indeterminate how accurate a model is to the target system. [0004] As the use has increased, so has the complexity of the underlying machine learning architecture. In many diagnostic or classification applications, various forms of neural networks are constructed with increasing number and complexity of nodes and layers and the breadth of training data has grown to increase the predictive capabilities. [0005] These methods enhance the accuracy of screening tests and aid endoscopists in identifying lesions during colonoscopy. Different models, ranging from simpler algorithms like SVM and KNNs to more intricate deep neural networks, have been employed for this purpose. However, these models often struggle when working with limited datasets, leading to overfitting issues that undermine their effectiveness and reliability in clinical settings. Additionally, they exhibit inadequate performance when confronted with imbalanced datasets. Specifically, models trained on imbalanced datasets are prone to biases; that is, they are more likely to miss classes that are underrepresented in the dataset. Furthermore, a review of the literature supports the assertion that most works on AI in healthcare focus on standard statistical metrics such as accuracy, sensitivity, and precision to report model performance. However, this is not sufficient since, by using such metrics, the potential risks of the model Attorney Docket No.10046-568WO1 are overlooked. In AI use cases in settings with the potential for serious harm to people (such as cancer diagnosis), requirements such as reliability, assurance, transparency, and meaningful estimates of confidence may aid clinicians in making more intuitive and informed decisions. Summary [0006] Exemplary methods and systems are disclosed herein related to the post- processing of machine learning and/or artificial intelligence classification models. In a first implementation, the exemplary methods and systems are configured to calibrate ML/AI classification output using scaling techniques. In a second implementation, the exemplary methods and systems provide one or more of the ML/AI classification predictions up to a user-defined confidence level (allowable error rate) as a means for user-ML/AI model interaction. Application of the exemplary methods and systems provides a platform for users to interact with the underlying ML/AI classification model based on a measure of user confidence in the model. The exemplary methods and systems are well suited for applications where classification can be subjective and require a level of user interpretability (e.g., medical diagnostics, complex engineering problems, financial modeling, etc.). [0007] Other implementations of the exemplary methods and systems include an interactive AI platform for end-users. In some implementations, a set of synthetic data may be produced using a generative network (e.g., generative adaptive network or stable diffusion). The set of synthetic data may be used for training the underlying ML/AI classification model in applications where robust data sets are limited (e.g., medical diagnostic images and engineering data at failure states). The interactive AI platform, comprising the methods and systems, allows an end-user to broaden or narrow the confidence window to learn proper identification and to use text querying to search for a database for supporting information of a classification output (e.g., supporting scientific literature or historical medical images or synthetic medical images of the same or different classes). When synthetic data is used, the ground truth will be known in the interactive AI platform. [0008] An example post-processing method and system are disclosed for the calibration of image classification from machine learning models. An example of an interactive AI platform using the same is disclosed. [0009] In some aspects, the techniques described herein relate to a method for post- processing a machine learning classification, the method including: classifying a dataset as one or more classes using a trained machine learning classifier and calculating an associated Attorney Docket No.10046-568WO1 probability score for each of the one or more classes; calibrating the probability score of the classification of the trained machine learning classifier using a regression operator (e.g., using the Cascade Reliability Framework (CRF)); and displaying as a report, the calibrated probability score for the classification of the machine learning classifier. [0010] In some aspects, the techniques described herein relate to a system for post- processing a machine learning classification, the system including: one or more processors; an output device; and a memory, the memory storing instructions thereon, that when executed by the one or more processors, causes the one or more processors to perform a method, the method including: classifying a dataset as one or more classes using a machine learning classifier and calculating an associated probability score for each of the one or more classes; calibrating the probability score of the classification of the machine learning classifier using a regression operator (e.g., using the Cascade Reliability Framework (CRF)); and displaying as a report or display, on the output device, the calibrated probability score for the classification of the machine learning classifier. [0011] In some aspects, the techniques described herein relate to an interactive artificial intelligence system, the system including: one or more processors; an output device; an input device; and two or more data storage devices, a first data storage device storing instructions thereon, that when executed by the one or more processors, causes the one or more processors to perform a method, the method including: classifying an image and generating an associated probability score using a trained machine learning classifier; receiving, from a user by an input device, an error rate variable value; calibrating the probability score of the image classification of the machine learning classifier using a regression operator (e.g., a Cascade Reliability Framework (CRF)) wherein the confidence of the calibrated probability score is related to the user's error rate variable value; calculating an attention map of the image (e.g., vision-transformer based classifier); adding descriptive text to the attention map; displaying, on the output device, the calibrated probability score for the image classification of the machine learning classifier, the attention map and descriptive text . Brief Description of the Drawings [0012] Figs.1A-1F show illustrative embodiments of an exemplary system. [0013] Fig.2 shows an exemplary method. [0014] Figs.3A-3B show example implementations of the system in a medical classification model. [0015] Fig.4 shows sample outputs of the vision-transformer model. Attorney Docket No.10046-568WO1 [0016] Fig.5 shows the architecture of the multi-modal explainability model. [0017] Fig.6 shows an architecture of a used Dilated Residual Network as the standard machine learning algorithm for the proposed CRF f ramework. [0018] Fig.7 shows images of the real colorectal cancer (CRC) polyp types as well as the fabricated phantoms replicating them. [0019] Figs.8A-8B show visual representations of the (8A) noise and (8B) blur levels used in the datasets. Higher levels of noise and blur lead to information loss and were considered non-ideal inputs. [0020] Figs.9A-9C show an exemplary algorithm including dataset pre-processing (Fig.9A), CRF Calibration (Fig.9B), and Evaluation (Fig.9C). [0021] Figs.10A-10B show reliability plots of accuracy versus confidence intervals for (10A) uncalibrated (CRF-1/2) and (10B) calibrated (CRF-3/4) models. For perfect calibration accuracy and confidence plot the identity function, miscalibration is seen as deviations from this trend. [0022] Figs.11A-11B show the sensitivity analysis results to find optimal combinations of λ and kreg for CRF-2 and CRF-4 models. Fig.11A shows the average set size, and Fig.11B shows the average coverage for different combinations of λ and kreg for these models. [0023] Figs.12A-12B shows plots of accuracy and confidence versus different levels of blur and noise. The model was tested on increasing levels of noise and blur, ranging from σ = 1 to 256 for blur (Fig.12A) and σ = 1 to 50 for noise (Fig.12B). [0024] Figs.13A-13B show plots of (13A) average coverage and (13B) average set size for the four different CRF models considered. [0025] Fig.14 shows the class-wise coverage of the four CRC polyps using the four CRF models calculated with different error rates of 0.2, 0.1, and 0.01. [0026] Figs.15A-15F show representative outputs of the proposed CRF framework, including the predicted polyp types as well as their corresponding confidences, compared with the ground truth. Results were generated using the CRF-4 model and a user-chosen error rate of α = 0.1. [0027] Fig.16 shows an illustrative embodiment of a computing device. [0028] Figs.17A-17D show correlation plots of synthetic and real data using t- distributed stochastic neighbor embedding (Fig.17A); multidimensional scaling (Fig.17B); Attorney Docket No.10046-568WO1 principal component analysis (Fig.17C), and uniform manifold approximation and projection (Fig.17D). [0029] Figs.18A-18D show correlation plots of testing and training data using t- distributed stochastic neighbor embedding (Fig.18A); multidimensional scaling (Fig.18B); principal component analysis (Fig.18C); and uniform manifold approximation and projection (Fig.18D). [0030] Figs.19A-19B show an accuracy analysis of synthetic data set for accuracy over time (Fig.19A) and noise and blur (Fig.19B). [0031] Figs.20A-20B show self-attention visualization method results for attention maps (Fig.20A) and GradCAM (Fig.20B). [0032] Figs.21A-21B show SHAP attention visualization methods Deep Explainer (Fig.21A) and Kernel Explainer (Fig.21B). [0033] Figs.22A-22B show comparisons between two methods: Self Attention versus SHAP Deep Explainer. [0034] Fig.23 shows the experimental setup: (1) CRC polyp phantoms, (2) Mark-10 Series 5 Digital Force Gauge, (3) M-UMR12.40 Precision Linear Stage, (4) HySenSe sensor, (5) Raspberry Pi 4 Model B, (6) HySenSe image output. (7) Dimensions of polyp phantom (8) HySenSe: Side view, top view, and dimensions. [0035] Fig.24 shows Algorithm 1 for synthetic data for training evaluation. [0036] Fig.25 shows an overview of the Denoising Diffusion Probabilistic Models (DDPM) architecture, a class-specific unconditional diffusion model. [0037] Fig.26 shows the proposed synthetic data augmentation process using DDPM. The CRC polyp phantom dataset is divided into sub-classes and passed to dedicated DDPM Image Generation Models. All images are then compiled together to form the final augmented dataset. [0038] Fig.27 shows synthetic images generated by each diffusion model contrasted with the corresponding real, textural images. [0039] Figs.28A-28C show validation and training set Accuracy curves for (Fig. 28A) Dilated ResNet (Fig.28B) ResNet18. and (Fig.28C) VGG16 when trained on X% synthetic data. Training curves at other X values are omitted to preserve clarity. [0040] Fig.29 shows a plot of the accuracy of the best-performing model for each architecture versus the amount of synthetic data utilized during training. [0041] Fig.30 shows the experimental setup, including: (3010) KUKA LBR Med 14 R820 (KUKA AG); (3020) Raspberry Pi 4 Model B; (3030) 3D printed mounting plate for Attorney Docket No.10046-568WO1 the tumor phantoms; (3040) Top view of HySenSe sensor showing all components; (3050) Example CAD model of synthetic AGC polyp phantom; (3060) Example partial textural image output of AGC tumor phantom. The figure also shows the defined reference frames R: Robot/World, B: Robot Flange, C: Camera, T: Target, and H: HySenSe base. [0042] Fig.31 shows a few representative fabricated AGC lesion phantoms and their dimensions. The first column shows the schematics [19] of four types of AGC tumors under the Borrmann classification. The second column indicates the corresponding real clinical endoscopic images [20]. The third and fourth columns show a sample-designed CAD model and a 3D-printed tumor for each class. Other columns show corresponding VTS outputs. [0043] Fig.32 shows the hyperparameter search for the three models (ResNet18, DRN, and AlexNet). Each line encodes a particular configuration. [0044] Fig.33 shows stratified 5-fold cross-validation results for the three models (ResNet18, Dilated ResNet, AlexNet). Average accuracy curves are reported, with the shaded region depicting the standard deviation across folds. [0045] Fig.34 shows normalized confusion matrices for all three models configured with their corresponding best hyperparameters. [0046] Figs.35A-35B show an overview of a class-conditioned diffusion model for data generation for training (Fig.35A) and inference (Fig.35B). [0047] Fig.36 shows the FID score versus the number of training steps for different hyperparameters and the minimum FID score of the hyperparameter configurations. [0048] Figs.37A-37B show diffusion model example outputs for creating synthetic images of Borrmann gastric tumors versus real image samples. Fig.37A shows real and synthetic images of 3D-printed gastric tumor phantoms. Fig.37B shows real and synthetic images of real tumors (collected from patients at MD Anderson). [0049] Figs.38A-38C show testing results of different methods of data addition: the composition of real and synthetic data used (Fig.38A); cross-validation results of training the Dilated ResNet using increasing amounts of synthetic data using random-add method (Fig. 38B) and scale-by-inverse method (Fig.38C). [0050] Fig.39 shows different augmentation scenarios. Without: The baseline scenario with no augmentations. Grayscale: Converted all images to grayscale. Hue, saturation, brightness, and contrast: changing color space of the image. All: The combination of all scenarios except Grayscale. [0051] Fig.40 shows a comparison of ResNet18 model performance metrics by scenario on the simulated test set. Attorney Docket No.10046-568WO1 [0052] Fig.41 shows a comparison of ResNet18 model performance metrics by scenario on the original test set. Detailed Specification [0053] To facilitate an understanding of the principles and features of various embodiments of the present invention, they are explained hereinafter with reference to their implementation in illustrative embodiments. [0054] Figs.1A – 1F show an exemplary system for post-processing of machine learning classifications that provide calibrated classifications within a user-defined confidence level in accordance with the illustrative embodiments. In Fig.1A, the system includes a “Machine Learning Classifier” module 120 that is configured to generate a machine learning classification output (121) from a given input (shown as “Medical Images” (111)) and to additionally provide calibration operations on such outputs. [0055] In Fig.1A, the system 100a includes the machine learning classifier module 120, a cascade reliability module 130, and means for user input 141 and output display 140. In some examples, the machine learning classifier module 120 is stored and executed on a first computing device, and the cascade reliability module 130 is stored and executed on a second computing device. Alternatively, the machine learning classifier module 120 and the cascade reliability module 130 are stored and executed on the first computing device. One or more of the first and second computing devices are hard-wired for communication or wirelessly connected. In some examples, the user input is received, and output is displayed at a first or second computing device via a graphical user interface in communication with the cascade reliability module 130. Alternatively, the user input is received, and output is displayed via an internet webpage operating on a user device. In some examples, the means for user input 141 and output display 140 are connected to the first or second computing device. Alternatively, the means for user input 141 and output display 140 is a user device in communication with the first or second computing device. [0056] In some examples, the machine learning classifier module 120 includes a support vector machine (SVM), a neural network, a convoluted neural network (CNN), a densely connected CNN, or a residual network type classifier. In some examples, the machine learning classifier utilizes transfer learning, wherein a large dataset for general image identification is first used for training, and then a smaller dataset of object-specific images is used for further training. While several examples are given, it is contemplated that any suitable artificial intelligence/machine learning classifier architecture, including object Attorney Docket No.10046-568WO1 detection and semantic segmentation, may be used in the machine learning classifier module 120. [0057] In some instances, the classification output 121 includes one or more classes and associated probability scores. [0058] In Fig.1A, the cascade reliability module 130 comprises submodules, confidence calibration module 131, and conformal prediction module 132 that are arranged in a cascading algorithm structure (i.e., cascade reliability module 130). The confidence calibration module 131 is configured to provide the calibration operation on the classification output 121. The conformal prediction module 132 receives the user input 141, which is a user-defined error rate, and is configured to provide the most probable classifications of the classification output 121, which includes the ground truth. [0059] In some instances, the calibration operation (e.g., confidence calibration module 131) includes temperature scaling, variational temperature scaling, or another regression operator. The confidence calibration module utilizes a subset (e.g., hold-out, validation) of training data of the machine learning classifier to calibrate the uncertainty of the ML classifier model. [0060] The conformal prediction module 132 comprises the Naïve Conformal prediction algorithm or regularized adaptive prediction sets algorithm and is configured to rank the one or more classes of the classification prediction and associated probability scores. The calibrated probability of the ranked classes is output up to one minus the error rate variable (e.g., user input 141). [0061] In connection with Fig.1A, the method of operation of the system 100a includes first receiving medical images from a medical imaging device, which is processed on a computing device using the machine learning classifier 120. The resulting classification output 121 is calibrated using the cascade reliability module 130 by first executing the confidence calibration module 131 and then the conformal prediction module 132. In some instances, the conformal prediction module 132 may be executed first, then the confidence calibration module 131 may be executed, or both modules may be executed in parallel, depending on the configuration of the computing device. A user input 141, error rate variable value, is provided and used in the conformal prediction module 132, thereby providing a means for user interaction with the system. A report is displayed on the output device, wherein the report includes the calibrated probability score, associated class, and/or the associated image. Attorney Docket No.10046-568WO1 [0062] Fig.1B shows the system 100b includes a generative model 110 configured to produce a plurality of synthetic images (shown as medical images 111’), the machine learning classifier module 120, the cascade reliability module 130, and means for user input 141 and output display 140. In some examples, the GAN module 110 is stored and executed on a first computing device together with the machine learning classifier module 120, and the cascade reliability module 130 is stored and executed on a second computing device. Alternatively, the GAN module 110 is stored and executed on a first computing device, and the machine learning classifier module 120 and the cascade reliability module 130 are stored and executed on the second computing device. In yet another alternative, the GAN module 110, the machine learning classifier module 120, and the cascade reliability module 130 are stored and executed on the first computing device. The first, second, and/or third computing devices are hardwired for communication or wirelessly connected. The location and connection of the system modules may be optimized for the needs of the application. [0063] In some examples, the user input is received, and output is displayed at a first, second, or third computing device via a graphical user interface in communication with the cascade reliability module 130. In other examples, the user input is received, and output is displayed via an internet webpage located on a user device. In some examples, the means for user input 141 and output display 140 are connected to the first, second, or third computing device. In other examples, the means for user input 141 and output display 140 is a user device in communication with the first, second, or third computing device. [0064] In Fig.1B, the cascade reliability module 130 comprises submodules, confidence calibration 131, and conformal prediction 132. The conformal prediction module 132 receives the user input 141, which is a user-defined error rate. [0065] In connection with Fig.1B, the exemplary system 100b includes the ML classifier module 120 that is configured to operate on a plurality of medical images generated from a generative model 110, which is processed on the computing device using a machine learning classifier 120. In some examples, the generative model 110 is a diffusion model or a Generative Adversarial Network (GAN). The plurality of images may be partitioned into a test set, a calibration set, and a training set. The training set of images is augmented by adding random noise, random blur, random rotations, random cropping, and vertical and horizontal flips before being used for training the machine learning classifier, and a subset of images is used for validation of the confidence calibration module. The trained machine learning classifier is configured to produce classification output 121. The resulting classifications are calibrated using the cascade reliability module 130 by first executing the Attorney Docket No.10046-568WO1 confidence calibration module 131 and then the conformal prediction module 132. In some instances, the conformal prediction module 132 may be executed first, then the confidence calibration module 131 may be executed, or both modules may be executed in parallel depending on the configuration of the computing devices. A report is displayed on the output device; the report includes the calibrated probability score, associated class, and/or the associated image. In this example, a user inputs 141 an error rate variable value, which is used in the conformal prediction module 132, thereby providing a means for user interaction with the system. [0066] The use of synthetic images (or datasets generally) provides robustness and generalizability of the machine learning classifier model. The synthetic medical images are generated using a text-condition generative model, such as a diffusion model or a GAN. For example, an input may be for an image of a polyp with a specific cancer subtype with visual occlusions in the top left of the image. Then, these synthetic images are augmented with the real training images. Finally, the ML model is re-trained on the augmented dataset to improve its accuracy and robustness on held-out test data. [0067] In Fig 1C, the machine learning classifier module 120 is configured to classify engineering images 112 and provide uncalibrated machine learning classification output 121. In other examples, the machine learning classifier module 120 may be configured to classify financial projections or insurance adjustments, LiDAR, radar, and other multi-dimensional datasets and images. It is noted that the systems and methods may be applied to multi- dimensional datasets as well as images. It should be understood that in the context of this description, datasets and images can be used interchangeably, and likewise, subunits of datasets (indices, patches) and images (pixels) may be interchanged in the description and still be part of the system and methods. It should further be understood that the systems and methods described herein may be adapted for domain-specific classification and still be part of the systems and methods of this description. [0068] In Fig.1C, the exemplary system 100c includes the ML classifier module 120 that is configured to operate on engineering images from an imaging device, which may be processed on the one or more processors using a machine learning classifier 120. The resulting classifications may be calibrated using the cascade reliability module 130, by first executing the confidence calibration module 131 and then the conformal prediction module 132. In some instances, the conformal prediction module 132 may be executed first, then the confidence calibration module 131 may be executed, or both modules may be executed in parallel, depending on the configuration of the computing device. A report may be displayed Attorney Docket No.10046-568WO1 on the output device 140”; the report may include the calibrated probability score, associated class, and the associated image. In this example, a user inputs 141”, an error rate variable value, which is used in the conformal prediction module 132, thereby providing a means for user interaction with the system. [0069] In Fig.1D, the exemplary system 100d includes the machine learning classifier module 120, the cascade reliability module 130’, and a means for displaying output 140’. In some examples, the machine learning classifier module 120 is stored and executed on a first computing device, and the cascade reliability module 130’ is stored and executed on a second computing device. In other examples, the machine learning classifier module 120 and the cascade reliability module 130’ are stored and executed on the first computing device. One or more of the first and second computing devices are hardwired for communication or wirelessly connected. In some examples, the output is displayed at a first or second computing device via a graphical user interface in communication with the cascade reliability module 130’. In other examples, the output is displayed via an internet webpage located on a user device. In some examples, the means for output display 140’ is connected to the first or second computing device. In other examples, the means for output display 140’ is a user device in communication with the first or second computing device. In Fig.1D, the cascade reliability module 130’ comprises a submodule, confidence calibration 131, configured to provide the calibration operation on the machine learning classification output 121. [0070] In connection with Fig.1D, the method of operation of the system 100d includes receiving medical images from a medical imaging device, which is processed on the computing devices using a machine learning classifier 120. The resulting classifications are calibrated using the cascade reliability module 130’ by executing the confidence calibration module 131. A report is displayed on the output device, the report includes the calibrated probability score, associated class, and the associated image. [0071] Interactive Artificial Intelligence System [0072] Figs.1E and 1F show exemplary interactive artificial intelligence (AI) systems that are complementary to the exemplary systems of Figs.1A-1D. Additional features in Figs. 1E and 1F include the use of large language models (LLM) trained on a large corpus of medical text data to substantiate the predictions from the image-based classifier and provide external evidence to explain the classification. [0073] Without loss of generality, the LLM can be fine-tuned from an open-source public model and on public medical datasets and additionally may be fine-tuned on private Attorney Docket No.10046-568WO1 patient data from a hospital and stored securely to comply with patient privacy protection regulations (e.g., HIPAA, GDPR, etc.). [0074] An additional set of algorithms, collectively referred to as the ‘Multi-Modal Explainability Module’, takes as input a medical image (or other multi-dimensional datasets), predictions from the CRF, and LLM to output a text description to substantiate classifier prediction. An example text description may include: “According to paper ABC from PubMed, cancer subtype X tends to have polyps of a large, rough texture compared to subtype Y, which we clearly see on the bottom left of the patient image.” The multi-modal explainability Module ties together image predictions and text knowledge bases. [0075] In Fig.1E, the exemplary system 100e includes the machine learning classifier module 120, a generative model 110, a medical text database 150, a natural language model 160, the cascade reliability module 130, a multi-modal explainability module 170, a means for user input 141”, and a means for displaying output 140”. In some examples, the machine learning classifier module 120 is stored and executed on a first computing device together with the GAN 110; the medical text database 150 and natural language model 160 are stored and executed on a second computing device; the cascade reliability module 130 is stored and executed on a third computing device together with the multi-modal explainability module; and the means for user input 141” and displaying output 140” are stored and executed on a user device. In some examples, the machine learning classifier module 120 and the GAN 110 are stored and executed on a first computing device together with the medical text database 150 and natural language model 160; the cascade reliability module 130 is stored and executed on a second computing device together with the multi-modal explainability module; and the means for user input 141” and displaying output 140” are stored and executed on a user device. In yet another example, each module is stored and executed on individual remote computing devices. One or more of the first, second, third, and/or one or more of the individual remote computing devices are hardwired for communication or wirelessly connected. In some examples, the output is displayed at a first or second computing device via a graphical user interface in communication with the multi-modal explainability module 170. In other examples, the output is displayed via an internet webpage located on a user device. In some examples, the means for output display 140’ is connected to the second or third computing device. In other examples, the means for output display 140’ is a user device in communication with the second or third computing device. [0076] In Fig.1E, the cascade reliability module 130 comprises submodules, confidence calibration 131, configured to provide the calibration operation on the machine Attorney Docket No.10046-568WO1 learning classification output 121, and conformal prediction 132. The multi-modal explainability module 170 includes submodules vision-transformer 171 and text-transformer 172. In the example, the output of the cascade reliability module 130 is passed to the multi- modal explainability module 170. The user input 141” includes communication (143) to the cascade reliability module 130 and, independently, communication (144) with the multi- modal explainability module 170. The multi-modal explainability module 170 receives output from the natural language model 160. [0077] In connection with Fig.1E, the exemplary system 100e includes the ML classifier module 120, which is configured to operate on a plurality of medical images generated from a generative model 110. The plurality of images may be partitioned into a test set, a calibration set, and a training set. The training set of images is augmented by adding random noise, random blur, random rotations, random cropping, and vertical and horizontal flips before being used for training the machine learning classifier. The trained machine learning classifier is configured to produce machine learning classification output 121. The resulting classifications are calibrated using the cascade reliability module 130 by first executing the confidence calibration module 131 and then the conformal prediction module 132. The multi-modal explainability module 170 includes submodules that operate on the output of the cascade reliability module 130: the vision-transformer module 171 is configured to produce an attention map associated with the image(s), and a text-transformer module 172 is configured to provide descriptive text to the output display 140”. The medical text database 150 provides training data for a natural language model 160, which could be a large language model (LLM). [0078] An associated report is displayed on the output device; the report includes the calibrated probability score, associated class, the associated image, the associated attention map, and descriptive text. In this example, a user inputs 141” are an error rate variable value, which is used in the conformal prediction module 132, and text (or voice)-based queries, which are interpreted in the multi-modal explainability module, thereby providing a means for user interaction with the system. [0079] The means for user input 141” and display output 140” may be an interactive display configured to operate on a user device in a software application, including an interactive graphical user interface. Alternatively or additionally, the means for user input 141” and display output 140” may be an interactive display configured to operate on an internet webpage. In one example, the interactive display may provide a first pane that displays the output of the associated report and a second pane that includes user input prompt Attorney Docket No.10046-568WO1 areas for user input associated with communication 143 and communication 144. In some examples, the first pane may display one or more images or textual data in response to a user communication 144, such as providing substantiating clinical reports, medical records, or related medical images in addition to the associated report. In other examples, a third pane may display one or more images or textual data in response to a user communication 144, such as providing substantiating clinical reports, medical records, or related medical images. [0080] A user may input pre-determined search queries, for example: “I want an error rate of 1%. Please find me all the cancerous images that you predict.” [0081] The natural language-based user input may query the system in natural language or verbal commands (via speech-to-text), for example: “Point me to relevant papers that explain how polyps of two different subtypes appear visually. Use that to explain why you classified image A as cancerous but not image B.” [0082] Alternatively, the natural language-based user input may query historical data using natural language: “Find me all patients of a similar demographic that had a cancerous image looking like this for which the AI classifier had a confidence above 90%.” The pre- determined search queries may be varied in ways that could be understood by natural language models or may be in a language other than English. [0083] The natural language model 160 is also used to create a text-conditioned image generation model, such as stable diffusion. This will allow a physician to create realistic synthetic images from natural language. For example, the user may input: "Generate an image that looks like a specific type of cancer but with more small polyps of a specific texture." The user input is processed through the text-transformer module 172, then processed through the vector database to correlate the text-based data to vector-image data. The transformed user input is used to augment real data to improve the robustness of a computer vision model on real patient data. [0084] The role of the natural language model is two-fold. First, it allows for the generation of synthetic medical images conditioned on a text prompt, which provides key context. Second, a concise report for the surgeon is created using an LLM that highlights the top-most likely cancerous polyps, the associated confidence scores, etc. The LLM can be used to automatically create a concise report for the surgeon. This report will rank each image by the predicted pathogenicity, link to medical evidence from the literature, and provide a summary of the computer vision model's predictions and confidences. [0085] While the vision-transformer model can be used in conjunction with any machine learning classification model, using the example vision-transformer model enhances Attorney Docket No.10046-568WO1 the explainability of the entire process. The vision-transformer module is configured to calculate a heatmap, an attention map, or Shapley values of key data that are directly related to the machine learning classification output 121. When more than one explainability metric is computed, correlations between compatible metrics may be evaluated. [0086] The vision-transformer model may be a machine-learning classifier based on transformer architecture. Transformer architecture relies on a parallel multi-headed attention mechanism, which breaks down images into small pixels and gives higher weight to more important pixels of the image. The vision-transformer machine learning classifier may comprise hidden layers, which may have embedded information to classify important pixels of the image, i.e., providing visualization of parts of the image to which the model is paying attention. [0087] By leveraging the hidden layers, it is possible to extract information from the vision-transformer machine learning model output to provide visually interpretative information. This additional visual information is provided to end-users for visual interpretation of the classifications. The display output 140” in this instance may provide a set of possible labels, their associated confidences, and an overlayed heat map that guides the end-user towards the regions of interest that influenced the machine learning model’s decision, helping them to identify critical features and potential abnormalities. [0088] The vision transformer model is coupled with a text-transformer model, which adds additional context to the classifications using descriptive text. For example, after the display output has been generated including the classifications and associated heat maps, a caption describing the kind of classes detected in the image can be generated. Difficulties faced by the system, such as due to loss of information from noise and blur, or partial contact between a species and an imaging device, etc., will be reported, again indicating the confidence of the predictions of the system 100e. [0089] In some instances, the descriptive text is generated by a natural language model (e.g., LLM). The natural language model may include a large language model and may be trained on medical data, including medical papers, ontologies, and knowledge databases. [0090] In Fig.1F, the exemplary system 100f includes the machine learning classifier module 120, a medical images database 115, the medical text database 150, the natural language model 160, the cascade reliability module 130, the multi-modal explainability module 170, a means for user input 141”, and a means for displaying output 140”, and a vector database 180. In some examples, the medical image database 115 and the medical text Attorney Docket No.10046-568WO1 database 150 are stored and executed on a first computing device; the machine learning classifier module 120 and natural language model 160 are stored and executed on a second computing device; the cascade reliability module 130 is stored and executed on a third computing device together with the multi-modal explainability module; the means for user input 141” and displaying output 140” are stored and executed on a user device; and the vector database 180 is stored and executed on a fourth device. In some examples, the machine learning classifier module 120 and natural language model 160 are stored and executed on a first computing device together with the medical text database 150 and the medical image database 115; the cascade reliability module 130 is stored and executed on a second computing device together with the multi-modal explainability module; and the means for user input 141” and displaying output 140” are stored and executed on a user device. In yet another example, all modules are stored and executed on individual remote computing devices, the same computing device, or combinations thereof. One or more of the first, second, third, and/or one or more of the individual remote computing devices are hardwired for communication or wirelessly connected. [0091] In Fig.1F, the cascade reliability module 130 comprises submodules, confidence calibration 131, configured to provide the calibration operation on the machine learning classification output 121, and conformal prediction 132. The multi-modal explainability module 170 includes submodules vision-transformer 171 and test-transformer 172. In the example, the output of the cascade reliability module 130 is passed to the multi- modal explainability module 170. The user input 141” includes communication (143) to the cascade reliability module 130 and, independently, communication (144) with the multi- modal explainability module 170. The communication (144) with the multi-modal explainability module may be a search of labeled medical images in the one or more databases. The multi-modal explainability module 170 receives output from the natural language model 160. [0092] Further supporting the interactive system 100f, the vector database 180 receives 145 user input 141”, which may be a search query, such as a text string, including alpha-numeric characters, or Boolean operators. The output device is in communication 146 with the vector database 180, wherein the output device 140” receives the search query results, which may be semantic search results. [0093] In the illustrative examples, the system 100 (i.e., 100a, 100b, 100c, 100d, 100e) is configured to perform the method using the one or more computing devices having executable code stored thereon. The method includes classifying a dataset, wherein a dataset Attorney Docket No.10046-568WO1 may be images, such as medical images or engineering data, using a trained machine learning classifier 120 to calculate associated probability scores (i.e., machine learning output 121, classification output 121). In some examples, the associated probability scores are a Softmax output of a penultimate layer of the machine learning classifier 120. The method includes calibrating the associated probability scores of the classification output using a set of algorithms collectively referred to as the cascade reliability module 130. The cascade reliability module 130 may comprise a confidence calibration module 131, which may be a regression operator, and a conformal prediction module 132, which is arranged in a cascading algorithm structure. Finally, the method includes displaying, as a report or display, on the output device, the calibrated probability score of the classification output. [0094] As shown in relation to Figs.1A-1F, the dataset can include medical images, engineering images, or synthetic images. In some examples, the images comprise augmented images or a plurality of augmented images, wherein augmentation may comprise adding random noise, random blur, random rotations, random cropping, vertical and horizontal flips, or any combination thereof. The machine learning classifier may be trained using the plurality of augmented images. [0095] In some examples, the machine learning module 120, the cascade reliability framework 130, and the multi-modal explainability module 170 are used independently or in combination. For example, the machine learning module 120 may be used with the multi- modal explainability module 170; the cascade reliability module 130 may be used with the multi-modal explainability module 170; or the machine learning module 120 may be used with the cascade reliability module 130. [0096] Indeed, the system 100 includes one or more computing devices, wherein each computing device comprises one or more processors, an input and output device, and a memory; the memory storing instructions thereon, that when executed by the one or more processors, causes the one or more processors to perform a method. In Fig.16, an illustrative embodiment of a computing device 1600 is shown. It should be understood that Figs.1A-1F show illustrative embodiments of the system and other variations or permutations of the modules on the one or more computing devices is possible and still describe the intended system 100. [0097] Example Method of Operation [0098] Fig.2 shows a method 200 for post-processing image classifications. The method 200 includes classifying 210 a dataset (e.g., 111) as one or more classes using a trained machine learning classifier (e.g., 120) and providing the classification output (e.g., Attorney Docket No.10046-568WO1 ML output 121). In some examples, the machine learning classifier provides one or more class predictions and associated probability scores for each class prediction. [0099] In the method 200, the associated probability score is a Softmax output of a penultimate layer of the machine learning classifier. The method includes calibrating the probability score of the classification of the machine learning classifier 220 using the cascade reliability module 130; in some instances, the cascade reliability module 130 includes a regression operator. The method further includes receiving an error rate variable from a user (i.e., user input 141). The method finally includes displaying, as a report or display, the calibrated probability score, associated class, and/or the associated image for the classification of the machine learning classifier 230. [0100] In some instances, calibrating (e.g., confidence calibration module 131) includes temperature scaling, variational temperature scaling, or another regression operator. [0101] In some instances, the one or more classes of the classification prediction are ranked using the Naïve Conformal prediction algorithm or regularized adaptive prediction sets algorithm (e.g., the conformal prediction module 132). The calibrated probability of the ranked classes is output up to one minus the error rate variable. [0102] In some instances of the method, the report or display includes the probability score and associated class. In other instances, the report or display includes the probability score, associated class, and the associated image. [0103] In some examples, the dataset includes a synthetically produced image or a plurality of synthetically produced images. For instance, the synthetically produced image(s) are produced using a GAN (e.g., GAN 110). In some instances, the machine learning classifier is trained using the plurality of synthetically produced images. In other instances, in addition to the previous example, the dataset includes augmented images or a plurality of augmented images, wherein augmenting includes adding random noise, random blur, random rotations, random cropping, vertical and horizontal flips, or any combination thereof. The machine learning classifier can be trained using the plurality of augmented images. [0104] In some examples, the machine learning classifier comprises a support vector machine (SVM), a neural network, a convoluted neural network (CNN), a densely connected CNN, and a residual network type classifier. In some examples, the machine learning classifier may utilize transfer learning, wherein a large dataset for general image identification is first used for training, and then a smaller dataset of object-specific images is used for further training. Attorney Docket No.10046-568WO1 [0105] In some examples, the method further comprises calculating a heat map of the dataset (e.g., vision-transformer model) and further comprises adding descriptive text to the output display (e.g., explainer module). In some instances, the descriptive text may be generated by a natural language model (e.g., LLM). [0106] The vision-transformer model is another machine-learning classifier based on transformer architecture. Transformer architecture relies on a parallel multi-headed attention mechanism, which breaks down images into small patches and gives higher weight to more important patches of the image. The vision-transformer machine learning classifier may comprise hidden layers, which may have embedded information to classify important patches of the image, i.e., providing visualization of parts of the image to which the model is paying attention. [0107] In some examples, color matching is used to train the vision-transformer model. The color matching may provide a unifying index of input image colors across any received image, which allows for the model to be device-agnostic with regard to the input images. In some examples, transfer learning is used in training with color matching for computer vision domain adaptation. [0108] For example, predictions from a pre-trained ML classification model can be visualized as to why it made its prediction. To do so, the most important pixels for the prediction are visualized (i.e., scored or scaled, then shown on a display). This can be done using standard methods such as attention maps, Shapley Values, or GradCAM. In some instances, the output of two or more methods is correlated to see where they concur to provide multiple sources of explanation to a user. [0109] Fig.4 shows the areas of high influence on the vision-transformer machine learning model in classification, highlighted in the form of a heat map. By leveraging the hidden layers, it is possible to extract information from the vision-transformer machine learning model output to provide visually interpretative information. This additional visual information is provided to clinicians for visual interpretation of the classifications. Coupled with the CRF, the exemplary system and method have become a powerful artificial intelligence assistance tool for diagnosis. The exemplary system and method may provide a set of possible labels, their associated confidences, and an overlayed heat map that guides the clinician toward the regions of interest that influenced the machine learning model’s decision, helping them to identify critical features and potential abnormalities. [0110] The vision transformer model may be coupled with a text-transformer model, as shown in Fig.5, which may add additional context to the classifications using descriptive Attorney Docket No.10046-568WO1 text. For example, after the system and/or method has generated the classifications and associated heat maps, a caption describing the kind of tumors or lesions detected in the image can be generated. Difficulties faced by the system and/or method, such as due to loss of information from noise and blur, partial contact, etc., will be reported, again indicating confidence of the predictions of the system and/or method. [0111] Example Cascade Reliability Framework (CRF) Module [0112] In critical applications such as cancer polyp diagnosis, it is not sufficient for the machine learning model to simply be correct; it must also be able to indicate whether its output is likely to be incorrect or not. In other words, the model should be able to tell the user whether it is unsure about a prediction or not. Of note, this can be captured by the model “confidence,” which is often erroneously taken to be the output of the penultimate SoftMax layer, which provides a value ranging from 0 to 1 for each class [32]. A high confidence should typically indicate less likelihood of the model being incorrect. However, this raw confidence is not an accurate representation of the ground truth likelihood of the model. This poses problems with interpretability because solely by looking at the model output, a user cannot know how sure the model is about a prediction. This is particularly of paramount importance for cancer polyp diagnosis applications in which incorrect detection of polyp type may lead to severe consequences or an unnecessary biopsy procedure. [0113] Fig.3A shows an exemplary system for uncertainty characterization of a pre- trained ML algorithm specifically developed for reducing the early detection miss rate (EDMR) of cancerous polyps. This architecture consists of three main modules, including a standard machine learning architecture integrated with the proposed Cascade Reliability Framework (CRF) (e.g., cascade reliability module 130), followed by an evaluation module. As shown, the dataset was split into three subsets: training set, holdout/calibration set, and test set. The augmented training set was used to train the base machine learning model. The uncalibrated SoftMax outputs of the base machine learning model were post-processed using variable temperature scaling (VTS), after which the conformal prediction framework was used to generate predictive sets. The test set was then used to evaluate the full model performance in terms of relevant metrics. [0114] Fig.3A shows an example partitioning of a plurality of images into a test set, a calibration set, and a training set. The training set of images is augmented by adding random noise, random blur, random rotations, random cropping, and vertical and horizontal flips before training a dilated residual network machine-learning model. The trained machine Attorney Docket No.10046-568WO1 learning model outputs to the CRF and independently to the confidence calibration module and conformal prediction module. The calibration set of images is also augmented by random noise and random blur before calibrating the machine learning model and CRF. In the CRF, the machine learning output may be passed to the confidence calibration module and/or the conformal prediction module. The output of the confidence calibration module may be passed to the conformal prediction module or the CRF report display with only a single identified class and associated calibrated probability. Alternatively, the conformal prediction module may output to the CRF report display, thereby providing all predicted classes up to the user-provided error rate and associated calibrated probabilities. In an optional implementation, the system, including the machine learning model and the CRF, may be evaluated using the set of test images to provide accuracy data, reliability diagrams, coverage, data, set size data, and class-wise comparisons. [0115] Specifically, the CRF combines two post-processing techniques of variational temperature scaling [32] [48] and conformal prediction [39] [40] [49] into a single cascade model that can be integrated with a pre-trained, independent machine learning model to quantify the uncertainty of the machine learning model output. Using the exemplary architecture, the benefits of both uncertainty characterization approaches were exploited to provide clinicians with two independent measures of reliability, which enhances the trustworthiness and explainability of any generic-type, pre-trained machine learning model. The following sections describe the components of the exemplary architecture in detail. [0116] Another exemplary and nonlimiting example of the system for uncertainty characterization of a pre-trained ML algorithm specifically developed for reducing the early detection miss rate (EDMR) of cancerous polyps is shown in Fig.3B. [0117] Temperature Scaling: The problem of matching the output confidence level with the ground truth correctness of the model is known as confidence calibration [32]. This is an essential step towards improving model interpretability as most of the deep neural networks are typically over-confident in their predictions [32]. Mathematically _{speaking, for input X with class labels ^^, if the predicted class is ^} ^{^} _{^, and ^} ^{^} _{^ is its associated} confidence, for a perfect calibration, the probability ^^ over the joint distribution is defined as:

[0118] where M is a Boolean, and M = 1 denotes a correct prediction matching the ground truth. Temperature scaling is one such method for confidence calibration. It is the extension of Platt scaling [50] that uses a single parameter T > 0 for all possible classes Attorney Docket No.10046-568WO1 of the SoftMax function. Guo et al. [32] have shown that temperature scaling is an effective method for confidence calibration. In the classical temperature scaling approach, given a logit vector z_i , which is the input to the SoftMax function σ_SM for the i^th class associated with it, the calibrated confidence prediction ^^^_ప taken over the k classes of the SoftMax is described as: ^^^ ^௭ ప ൌ max ^^_ௌெ ^ ^{^}^^{^^^} ^_்

[0119] where T is a learned parameter, and its objective is to “soften” out the outputs of the SoftMax layer. Of note, the learning of parameter T was obtained via different optimization functions [32]. The most common learning approach is the negative log-likelihood minimization [32]. [0120] Despite the features of the classical temperature scaling, this calibration method suffers from an intrinsic (epistemic) uncertainty [31] [48], which is essential to be estimated in critical applications such as cancer polyp diagnosis. To address this important issue, an advanced version of the classical temperature scaling, called variational temperature scaling (VTS), was introduced by Kuppers et al. [48]. This method is more reliable as it utilizes stochastic variational inference to build a calibration mapping that outputs a probability distribution rather than a single calibrated confidence, thus also quantifying the uncertainty of the calibration. The VTS places an uninformative Gaussian prior ^^^ ^^^ with high variance over the parameters T and infers the posterior given by

[0121] This captures the most probable calibration parameters given the model output Y and the ground truth correctness M. To obtain a distribution ^^^∗ as the calibrated estimate and knowing the posterior, a new input ^^^∗ can be mapped with the posterior predictive distribution defined by: ^^^ ^^^∗| ^^^∗, ^^, ^^^ ൌ _∬ ^{^} ି_^ ^^^ ^^^∗| ^^^∗, ^^^ ^^^ ^^| ^^, ^^^ ^^ ^^ (4) [0122] Of note, this equation was not determined analytically, so stochastic variational inference (SVI) was used as an approximation [31] [51] [52]. Evidence lower bound (ELBO) loss [31] [51] was used to optimize the parameters of this distribution to match the known true posterior. Finally, t sets of calibration parameters from T were sampled and used to obtain a sample distribution consisting of t estimates for a new single input ^^^∗. Attorney Docket No.10046-568WO1 [0123] Conformal Prediction: While using a calibrated neural network improved model interpretability, such models typically output only a singular point prediction. There was no further information provided as to what else the prediction could be, nor is there any quantification of the predicted label. Nevertheless, and as highlighted before, in critical applications such as cancer polyp diagnosis, it is important to complete the picture by providing alternate predictions to counter low confidence output. At the same time, if all the predictions together form a set that is mathematically guaranteed to contain the true label with a chosen error rate, clinicians can intuitively interact with the base machine learning model and be confident that they will not miss any diagnoses above a certain threshold. This approach ensures accuracy, reduces the chance of errors, and gives clinicians the ability to tune the level of trust they have in the model. [0124] Conformal prediction is a user-friendly framework for generating such statistically relevant uncertainty sets or intervals for predictive models. It is indeed the only set of algorithms that can provide such guarantees without assuming anything about the dataset other than that it is independent and identically distributed (i.i.d.). The method can also be seen as taking any heuristic notion of uncertainty in a model and converting it to a rigorous one [40]. A brief description is provided in this section, and further details may be found in [39] and [40], which are incorporated herein by reference in their entirety. [0125] Consider a dataset ^x, y^ where x is the set of data to be labeled, and y is a set of K possible labels. Suppose a classifier output estimated “probabilities” (SoftMax _{scores) for the K classes: ^} ^{^} _{^^ ^^^ ൌ ^0, 1^} ^{^} _{. Note that this probability may or may not} represent the actual ground truth correctness likelihood of the model. Reserving a reasonable number ^n^ of fresh i.i.d. data pairs unseen during training, the conformal prediction objective was to construct a set of prediction labels ^^^{^} ^^_௧^^௧ ^{^} ⊂ ^1, 2, ...K^ such that for a fresh ^{test point ( ^^௧^^௧, ^^௧^^௧) from the same distribution with α ൌ ^0, 1^ as a user-chosen error rate:}

[0126] This is known as the property of marginal coverage [39] [40], which states that the average probability of the prediction set containing the correct label is almost precisely 1 െ ^^ over the calibration and test points. It is worth mentioning that the user-chosen error rate ^^ is a special parameter for clinical applications, as it allows the clinicians to intuitively interact with a pre-trained machine learning model and establish and tune the level of trust that they have in the machine learning model. For instance, by setting the allowable error rate ^^ to 10% or 5%, the sizes of the prediction sets are Attorney Docket No.10046-568WO1 controlled, choosing the amount of information received from the model. As an example, allowing the model to have at most a 5% error would lead to a larger prediction set size when model confidence is not high. [0127] In the construction of a set of prediction labels C from ^^{^}^, a simple calibration step is needed utilizing a scoring function ^^^ ^^, ^^^, which encodes a worse agreement between x and ^^. The choice of the scoring function depends on which algorithm was _{used. The next step was to calculate ^^^ as the} ^{^^ା^^^^ିఈ^}

_{quantile of the calibration} ^scores ^{^^^= ^^^ ^^} _^ ^{, ^^} _^ ^{^, ... ^^^ =}

^{Finally, this quantile} _{can be used to generate prediction} sets for new test points ^^: ^^^ ^^_௧^^௧^ ൌ ^ ^^ ∶ ^^^ ^^_௧^^௧, ^^^ ^ ^^^^ (6) _{[0128] The simplest algorithm sets the scoring function as ^^^ ൌ 1 െ ^} ^{^} _{^^ ^^^^^^ (i.e.,} one minus the SoftMax output for the true class), which is hereto referred to as Naive Conformal Prediction (NCP). There are two considerations with this approach: (i) since the SoftMax outputs rarely represent the true class probabilities, NCP may not achieve coverage [32]; (ii) Additionally, NCP produces sets with the smallest average size and tends to undercover difficult subgroups and overcover easy ones [49]. Herein, NCP was compared with another algorithm called Regularized Adaptive Predictive Sets (RAPS) [49], which improves on a commonly utilized algorithm (i.e., Adaptive Predictive Sets (APS)) introduced by Romano et al. [53] in 2020. RAPS adds a regularization technique to the APS scoring function that tempers the noisy tail probabilities of a model. It is worth mentioning that the APS approach achieves coverage but has the disadvantage of producing much larger set sizes, hence requiring regularization [49]. [0129] The scoring function for APS is given by the following equation, which is the total probability mass of the set of labels more likely than ^^:

[0130] Taking o_x^y^ as the rank of y based on fˆ(x) (for example, if y is the second most likely label, then ox^y^ = 2), the new scoring function for RAPS becomes [49]: ^_^ ^{ᇱ^} _{^^, ^^} ^{^} _{ൌ ^^} ^{^} _{^^, ^^} ^{^} _{^ ^} ^{^} _{^^ ^^^௬ ∙ ^^ ^ λ.൫ ^^௫} ^{^} _^^ ^{^} _{െ ^^} ^ା ^_{^^൯ (8)} [0131] where ^… ^^ା denotes the positive part of ^ ^^_௫^ ^^^ െ ^^_^^^^ and λ, ^^_^^^ ^ 0 _{are regularization hyperparameters. The second term, ^} ^{^} _{^^ ^^^௬ ∙ ^^, is a randomized term ( ^^ is} chosen from the uniform distribution) to handle the discrete jump with the inclusion of Attorney Docket No.10046-568WO1 each new ^^. The rest of the algorithm is the same, involving the computation of the parameter ^^^ and then using the scoring function ^^^ᇱ^ ^^, ^^^ to generate predictive sets. [0132] In some aspects, the model may have deficiencies and/or over abundances of different classes, causing an imbalance in the data and propagating biases in the model. In such cases, additional predictive sets are generated based on the class distribution of the models. The additional predictive sets are added to the training data to balance the class distribution. The model may be tested by cross-validation before and/or after the additional predictive sets are included and the model trained on a class basis. [0133] Referring now to Figs.25, in an nonlimiting example, the generative model may include a class-specific unconditional diffusion process, where each diffusion model generates images directly for a single class configuration. In one example, the diffusion process is a Denoising Diffusion Probabilistic Model (DDPM). A pipeline for generating data for multiple classes is shown in Fig.26. [0134] Alternatively, a class-conditioned diffusion model may be used to generate a plurality of classes using a single model (see Figs.35A and 35B). Referring now to Fig. 35A, a class-conditioned diffusion model is trained with and without class label information. The inference of the class-conditioned diffusion model provides classifier-free guidance and can generate synthetic data of all possible classes. [0135] In some examples, the method of adding synthetic data to a real data set for testing and validation may be varied. For example, as shown in Fig.38A, synthetic images are added to each class such that all classes have an equal number of images (left image); synthetic images are randomly added to each class with no prior information (center image); or synthetic images are added to each class based on the number of images already present in each class (right image). [0136] Example Deep Learning Model Module [0137] Due to their ability to assuage the exploding gradient problem [23] [26], Residual networks (ResNets) are one of the standard model architectures used for cancerous polyp classification tasks [9] [13]. Skip connections utilized in ResNets also reduce the degradation problem, allowing for deeper, more complex models without affecting performance [23]. In order to extract the maximum possible amount of detail from the images, dilated convolutions were also explored [54] [55]. This technique expands the kernel field of view, making it more receptive to minute details while maintaining the spatial resolution of the feature maps [55]. This approach has turned out to be crucial to capturing the intricate textual details integral for pit-pattern classifications, where the Attorney Docket No.10046-568WO1 images are largely similar except for certain textures [5]. Coupling dilated convolutions with transfer learning approaches has shown promising results in the face of limited datasets [27] [28]. While conventional ResNets do not utilize dilated kernels, the effectiveness of using Dilated ResNets to capture and classify the textural details in datasets has been shown [56]. In this example, the machine learning model was pre-trained on the ImageNet database [57], which is a large general-purpose dataset consisting of more than 14 million images with over 20000 categories, then fine-tuned with a custom dataset, since these textural images are significantly different from everyday images. As shown in [58], the exemplary model, when tested on the custom dataset, outperformed state-of- the-art networks across clinically relevant statistical metrics such as accuracy (A), sensitivity (S), precision (P), etc. [0138] In this example, without loss of generality, the dilated residual network shown in Fig. 6 was used. The two mentioned post-processing techniques of temperature scaling and conformal prediction to attach reliability and trustworthiness to this model were applied, making the output clinically relevant and easy to discern. Of note, as shown in Fig. 1, the exemplary CRF was readily integrated with a standard machine learning architecture. Fig.6 shows the architecture of the dilated residual network described in the current example. The number of input channels and layer-wise values for the parameters n (number of filters), k (kernel sizes), d (dilation), s (strides), and p (padding) for each block are specified in the figure. Multiple values are shown in a block corresponding to each of the block’s convolutional layers. Default values were k=3, p=1, d=1, s=1, unless stated otherwise in the figure. The final predicted class is highlighted using a dashed box. [0139] The exemplary model featured: (1) a convolutional module comprising three convolutional blocks, each containing a 2D convolutional layer, a Batch Normalization layer, and a Rectified Linear Unit (ReLU) layer; (2) three Basic ResNet modules chaining two types of Basic ResNet blocks, one with a convolutional block with ReLU activation followed by two additional convolutional blocks without ReLU activation, and the other with a ReLU activated convolutional blocks followed by a non-activated ^{convolutional block; (3) an average pooling layer, and (4) a classifier composed of a 3 ൈ} _{3 convolutional layer that outputs to the four classes. The final Basic ResNet block} contained both padding and dilated convolutions with a factor of two each. More details about this architecture, the reasoning for these features, and its performance evaluation can be found in [58], which is expressly incorporated herein by reference in its entirety. Attorney Docket No.10046-568WO1 [0140] Additional examples of machine learning and artificial intelligence models that may be used instead of or in addition to the above example are described herein. However, the implementation of the above or any other machine learning component in the exemplary methods and systems is not limited to those described. The person skilled in the art will understand that any number of known machine learning and/or artificial intelligence models may be well-suited for a particular application and adapted appropriately. [0141] Machine Learning. The term “artificial intelligence” (e.g., as used in the context of AI systems) can include any technique that enables one or more computing devices or computing systems (i.e., a machine) to mimic human intelligence. Artificial intelligence (AI) includes but is not limited to knowledge bases, machine learning, representation learning, and deep learning. The term “machine learning” is defined herein to be a subset of AI that enables a machine to acquire knowledge by extracting patterns from raw data. Machine learning techniques include, but are not limited to, logistic regression, support vector machines (SVMs), decision trees, Naïve Bayes classifiers, and artificial neural networks. The term “representation learning” is defined herein to be a subset of machine learning that enables a machine to automatically discover representations needed for feature detection, prediction, or classification from raw data. Representation learning techniques include, but are not limited to, autoencoders. The term “deep learning” is defined herein to be a subset of machine learning that enables a machine to automatically discover representations needed for feature detection, prediction, classification, etc., using layers of processing. Deep learning techniques include but are not limited to artificial neural networks or multilayer perceptron (MLP). [0142] Machine learning models include supervised, semi-supervised, and unsupervised learning models. In a supervised learning model, the model learns a function that maps an input (also known as feature or features) to an output (also known as target or target) during training with a labeled data set (or dataset). In an unsupervised learning model, the model has a pattern in the data. In a semi-supervised model, the model learns a function that maps an input (also known as a feature or features) to an output (also known as a target) during training with both labeled and unlabeled data. [0143] Neural Networks. An artificial neural network (ANN) is a computing system including a plurality of interconnected neurons (e.g., also referred to as “nodes”). This disclosure contemplates that the nodes can be implemented using a computing device (e.g., a processing unit and memory as described herein). The nodes can be arranged in a plurality of layers, such as an input layer, an output layer, and optionally one or more hidden layers. An Attorney Docket No.10046-568WO1 ANN having hidden layers can be referred to as a deep neural network or multilayer perceptron (MLP). Each node is connected to one or more other nodes in the ANN. For example, each layer is made of a plurality of nodes, where each node is connected to all nodes in the previous layer. The nodes in a given layer are not interconnected with one another, i.e., the nodes in a given layer function independently of one another. As used herein, nodes in the input layer receive data from outside of the ANN, nodes in the hidden layer(s) modify the data between the input and output layers, and nodes in the output layer provide the results. Each node is configured to receive an input, implement an activation function (e.g., binary step, linear, sigmoid, tanH, or rectified linear unit (ReLU) function), and provide an output in accordance with the activation function. Additionally, each node is associated with a respective weight. ANNs are trained with a dataset to maximize or minimize an objective function. In some implementations, the objective function is a cost function, which is a measure of the ANN’s performance (e.g., an error such as L1 or L2 loss) during training, and the training algorithm tunes the node weights and/or bias to minimize the cost function. This disclosure contemplates that any algorithm that finds the maximum or minimum of the objective function can be used for training the ANN. Training algorithms for ANNs include but are not limited to backpropagation. It should be understood that an artificial neural network is provided only as an example machine learning model. This disclosure contemplates that the machine learning model can be any supervised learning model, semi-supervised learning model, or unsupervised learning model. Optionally, the machine learning model is a deep learning model. Machine learning models are known in the art and are therefore not described in further detail herein. [0144] A convolutional neural network (CNN) is a type of deep neural network that has been applied, for example, to image analysis applications. Unlike traditional neural networks, each layer in a CNN has a plurality of nodes arranged in three dimensions (width, height, and depth). CNNs can include different types of layers, e.g., convolutional, pooling, and fully connected (also referred to herein as “dense”) layers. A convolutional layer includes a set of filters and performs the bulk of the computations. A pooling layer is optionally inserted between convolutional layers to reduce the computational power and/or control overfitting (e.g., by down-sampling). A fully connected layer includes neurons, where each neuron is connected to all of the neurons in the previous layer. The layers are stacked similarly to traditional neural networks. GCNNs are CNNs that have been adapted to work on structured datasets such as graphs. Attorney Docket No.10046-568WO1 [0145] Other Supervised Learning Models. A logistic regression (LR) classifier is a supervised classification model that uses the logistic function to predict the probability of a target, which can be used for classification. LR classifiers are trained with a data set (also referred to herein as a “dataset”) to maximize or minimize an objective function, for example, a measure of the LR classifier’s performance (e.g., an error such as L1 or L2 loss), during training. This disclosure contemplates that any algorithm that finds the minimum of a cost function can be used. LR classifiers are known in the art and are therefore not described in further detail herein. [0146] An Naïve Bayes’ (NB) classifier is a supervised classification model that is based on Bayes’ Theorem, which assumes independence among features (i.e., the presence of one feature in a class is unrelated to the presence of any other features). NB classifiers are trained with a data set by computing the conditional probability distribution of each feature given a label and applying Bayes’ Theorem to compute the conditional probability distribution of a label given an observation. NB classifiers are known in the art and are therefore not described in further detail herein. [0147] A k-NN classifier is a supervised classification model that classifies new data points based on similarity measures (e.g., distance functions). The k-NN classifiers are trained with a data set (also referred to herein as a “dataset”) to maximize or minimize a measure of the k-NN classifier’s performance during training. The k-NN classifiers are known in the art and are therefore not described in further detail herein. [0148] A majority voting ensemble is a meta-classifier that combines a plurality of machine learning classifiers for classification via majority voting. In other words, the majority voting ensemble’s final prediction (e.g., class label) is the one predicted most frequently by the member classification models. The majority voting ensembles are known in the art and are therefore not described in further detail herein. Experimental Results and Additional Examples Example 1: CRC Polyp Study [0149] A study was conducted that developed and evaluated a machine learning system with a novel CRF utilizing two independent cascade layers of post-processing over a pre-trained ML algorithm for enabling informed and intuitive clinician-AI interactions. An ML model augmented by a cascade reliability framework (e.g., cascade reliability module 130) produced predictions that not only mathematically guaranteed a chosen error rate by clinicians but also attached a realistic measure of confidence in each prediction. Using an Attorney Docket No.10046-568WO1 augmented dataset consisting of 220 textural images of synthetic polyps (including non-ideal inputs containing noise and blur) fed to four different types of CRF models, it was demonstrated that the reliability and interpretability of the model by averaging performance metrics over 100 trials for each model configuration and data point. Reliability diagrams showed calibration and other metrics, such as coverage and average set size, to show the effectiveness of the conformal prediction layer. The calibrated CRF models handled non- ideal inputs with noise and blur until a threshold of σ = 16 for noise and σ = 20 for blur. Using the user-defined error rate α, clinicians are able to intuitively interact with the pre- trained deep learning model and decide the level of trust and the amount of information provided by this model. Therefore, by providing two layers of information, the clinician has greater freedom to intuitively adjust their level of trust to the deep learning model and then assess the trustworthiness of a prediction. [0150] Realistic CRC Polyp Phantoms Experiments. Without loss of generality of the exemplary method on the type of training data, various realistic CRC polyp phantoms were designed and additively manufactured to provide data for training the described deep learning model and evaluating the proposed framework [47]. To design the polyp phantoms, Kudo’s pit pattern classification [5] was followed, and four polyp types were fabricated, including Asteroid (A), Gyrus (G), Oval/Tubular (O), and Round (R) textural geometries. These pit patterns were chosen such that there was an even mix between non- cancerous (i.e., type A and R) and cancerous (i.e., type G and O) polyps. Ten geometric variations comprising four materials with varying hardness were added; thus, each of the 160 (= 4×10×4) polyp variations was unique. Figure 3A conceptually illustrates these variations as a tensor in which each specific polyp was indexed as

in which index i ∈ ^1, 2, 3, 4^, j ∈ ^1, 2, ...10^, and k ∈ ^1, 2, 3, 4^ represent tumor type, geometric variation, and hardness, respectively. Notably, the feature dimensions of the phantoms range from 300 to 900 microns, with an average spacing of 600 microns between pit patterns [5]. The polyps, designed in SolidWorks (SolidWorks, Dassault Systems), were printed using the J750 Digital Anatomy Printer (Stratasys, Ltd). Fig.7 shows the different material combinations used. These polyps were classified based on Kudo pit patterns, such as Asteroid, Gyrus, Round, and Oval. Also, a conceptual three-dimensional illustration of the textural image dataset collected by the HySenSe VTS on 10 variations of 3D printed polyp phantoms is shown in the figure. Each of the polyp phantoms was printed with four Attorney Docket No.10046-568WO1 different materials (i.e., DM-400, DM-600, A-40, and A-70). More details about the fabrication of polyp phantoms are found in [58]. [0151] Vision-based Tactile Sensor (VS-TS) and Data Collection. The dataset used in this example was collected using a novel VS-TS [15], [47], which consists of (I) a deformable silicone membrane that directly interacts with the target surface, (II) an optical module (Arducam 1/4 inch 5 MP camera), that captures the minute deformations of the gel layer in case of interaction with a texture, (III) a transparent acrylic plate providing support to the gel layer, (IV) an array of Red, Green and Blue LEDs to provide internal illumination for depth perception, and (V) a rigid frame supporting the entire structure. The VS-TS operates on the principle that the deformation caused by the interaction of the deformable membrane with the CRC polyps’ surface can visually be captured by the embedded camera. More details about the fabrication and functionality of this sensor are found in [15], [47]. [0152] In this example, the HySenSe sensor [47] was used in the experimental setup. To simulate a full and partial contact between the sensor and polyps that might happen in a realistic diagnosis setting, textural images of polyp phantoms were captured at two contact angles (0° and 45°) with a force of 2 N in the vertical direction. Of note, 0° mimics a complete interaction between the sensor and polyp (see Fig. 3 and Figs. 15A-15E), whereas 45° captures the incomplete interactions, which limits the textures visible in the image (see Fig.15B and Fig.15E). All 160 unique polyps were used for the 0° orientation, whereas half of the variations j from each polyp class i chosen randomly across the four materials j were used for the 45° orientation. Rejecting the unusable images resulted in 229 samples constituting the final dataset, with 57, 57, 55, and 60 visuals for classes A, G, O, and R, respectively. [0153] Dataset Pre-Processing. A 75/25 split was used for the dataset, meaning that 75% (174 samples) of the dataset was used for training the model via stratified 5-fold cross-validation, while 25% (55 samples) were reserved for model evaluation and post- processing. The HySenSe textural visuals were cropped and centered to include only the polyp texture of interest and downsized to 224 × 224 in order to improve the model performance. Geometric data augmentation techniques, namely random cropping, horizontal and vertical flips, and random rotations between -45° and 45°, each with an independent occurrence probability of 0.5, were applied during training. Additionally, to enhance the model’s ability to generalize, as shown in Figs. 8A and 8B, Gaussian blur and Gaussian Attorney Docket No.10046-568WO1 noise were introduced with strengths σ ranging from 1 to 256 for blur and σ from 1 to 50 for noise. This step was based on the results of previous work [38], which suggested that training the model on blurry and noisy data seems to improve the calibration of confidence levels for a prediction. The maximum levels were chosen such that they are well beyond the worst case the model may encounter in a clinical setting. The probability of the Gaussian transforms was also kept at 0.5 for each sample. [0154] The 55 samples of the holdout dataset were augmented using noise and blur in order to increase the size of the dataset. Four subsets were constructed: Group 1, consisting of “clean” images without any transformations applied to the samples; Group 2, with each of the 55 samples incorporating random levels of “Gaussian Blur;” Group 3, with each of the samples incorporating random levels of “Gaussian Noise;” and Group 4 with all the images experiencing a random “combination of both Gaussian Blur and Gaussian Noise.” These four subsets were then combined to obtain the final expanded holdout dataset. The maximum noise and blur levels were limited to values of σ to 32 and 30 for blur and noise, respectively, in order to simulate a more real-world clinical setting. Note that these groups are not used separately for testing, they are only used as a combination such that there is no bias towards any one kind of uncertainty (noise or blur). The reason for introducing noise and blur as data augmentation techniques was to capture uncertainty in imaging within the data itself. In other words, as shown in a previous study [38], if the model has already encountered imperfect images during training, it is less likely to be incorrect about a prediction that has noise or blur. [0155] Evaluation Metrics: The study performed evaluations for a number of metrics, including accuracy, reliability, coverage, and set size. [0156] Accuracy: Accuracy is a classical statistical metric. It is the fraction or percentage of predictions that the model has classified correctly. Mathematically, for a set of data ^x, y^ with N samples where ^^^_ప is the prediction, and ^^_^ is the ground truth, accuracy acc is defined as below:

[0157] Reliability Diagrams: The degree of calibration of a model is intuitively represented by reliability diagrams [32], which are accuracy versus confidence histograms. By grouping model predictions into discrete confidence bins and computing the average accuracy per bin, the expected sample accuracy can be plotted as a function of confidence. Attorney Docket No.10046-568WO1 As is expected, if the model is perfectly calibrated, the reliability diagram plots the Identity function. Miscalibration can, therefore, be seen as deviations from the Identity function. [0158] Taking Bm as the set of indices in the m^th confidence interval considered _{over M equally spaced confidence bins, with n as the} ^{number of samples, the average} ^{accuracy of B} _m ^{can be calculated} _{as [32] [37]:}

[0159] The average confidence within the set Bm, taking ^^^_ప to be the confidence of sample i is [32]:

[0160] Of note, these values are equivalent to a perfectly calibrated model. [0161] Coverage: Coverage refers to the property of a prediction set that represents the proportion of true labels that this set contains. More specifically, it indicates the probability that the prediction region includes the true label value. Mathematically, for a set of test points ( ^^_௧^^௧, ^^_௧^^௧), coverage is given by:

[0162] To determine class-wise coverage of the CRF model, the coverage is calculated separately for each label. [0163] Average set size: This metric is useful to look at the average size of the sets generated by the conformal prediction algorithm. Both too-large and too-small prediction sets lead to a loss in the interpretability of the model. Since there were four possible labels (i.e., four CRC polyp types), having an average set size of one or three negates the advantages that the conformal prediction has to offer, making it difficult to gain useful information from the model. For example, if, for most uncertain predictions, the model outputs three of the four possible labels, it can only be inferred that the fourth type is not likely. Similarly, if the set size is always one, the clinician is deprived of additional information in terms of alternate predictions. [0164] Evaluation Process Workflow. As conceptually shown in Fig.3, to thoroughly evaluate the performance of the CRF in integration with the dilated residual network, four different CRF models were compared, including: (CRF-1) NCP only, (CRF-2) RAPS only, (CRF- 3) NCP + VTS, and (CRF-4) RAPS + VTS at three different error rates of α = 20%, 10% and 1%. T he CRF models were evaluated at these three representative error rates to demonstrate the limits of what the error rate can be. Of note, the first two models Attorney Docket No.10046-568WO1 (i.e., CRF-1 and CRF-2) did not include the calibration step, shown in Fig.3, whereas in the CRF-3 and CRF-4 models, the outputs of the dilated residual network were first calibrated and then was cascaded through the Conformal Prediction block. Through these different architectures, it was determined what positive effects temperature scaling had on the performance of the CRF models. The CRF models were also validated, and the better- performing conformal prediction algorithm was determined for the application. To this end, using the custom validation dataset, the coverage and average set size metrics were calculated. All values were reported by running n ൌ 100 trials, with a 75/25 split used for the holdout/calibration and validation dataset for each trial. Fig.3 and Figs.9A-9C represent and summarize the steps taken to train and evaluate either of these 4 CRF models integrated with the Dilated residual network. [0165] Calibration Results and Discussion. To represent the deep learning model’s calibration using reliability diagrams [32], the average accuracy per confidence interval can be calculated using Equation (10) and plotted versus confidence. Figs.10A and 10B illustrate such reliability diagrams for the considered uncalibrated (i.e., CRF-1/2) and calibrated (CRF-3/4) models. Both CRF-3 and CRF-4 employ the same VTS algorithm for confidence calibration and thus have identical plots. Additionally, the models without VTS post-processing (CRF-1 and CRF-2) perform identically to the base Dilated ResNet model and exhibit its characteristics with respect to the calibration. [0166] As can be observed from the confidence histograms in Figs. 10A-10B, the uncalibrated models CRF-1/2 have a slight gap between average accuracy (78%) and confidence (80%). Of note, this slight difference is due to the introduction of uncertainty (i.e., noise and blur) within the dataset and including them during the training process of the dilated residual network. This result also agrees with a previous study [38] in which, by performing independent analyses on the effects of noise and blur on the training data and the confidence of the predicted outputs, it was shown that introducing uncertainty during the training step would reduce the degree of miscalibration of the base model. As can be seen in Figs.10A-10B for the calibrated models (i.e., CRF-3/4), upon utilization of the VTS calibration procedure, this slight accuracy and confidence gap was almost completely removed. The deviations from perfect calibration are either reduced or explained by the uncertainty in calibration (represented by error bars for each confidence interval in Fig. 10B), highlighting an important advantage of the VTS step. It is worth emphasizing that moving from a single confidence number to a probability distribution of the calibrated confidences allows the model to account for these deviations by including them Attorney Docket No.10046-568WO1 within the range of epistemic uncertainty. Thus, the temperature-scaled models (i.e., CRF-3 and CRF-4) achieve close to perfect calibration and provide accurate confidence values for each prediction, allowing clinicians to reliably accept or reject model outputs based on high or low confidence numbers. [0167] Additionally, to evaluate the robustness of the considered four CRF models in the presence of a lack of detail caused by noise or blur in the obtained textural images, the average accuracy and confidence versus increasing levels of noise and blur were plotted in Figs.11A-11B. As can be observed, with the increase in the levels of noise and blur, the degree of miscalibration for both the non-calibrated (CRF-1/2) and calibrated (CRF-3/4) models increased. Particularly, at σ ൌ 16 for blur and σ ൌ 20 for the noise levels (shown in Figs. 8A and 8B), the accuracy-confidence gap increases to 10%. Further, at these uncertainty levels, the accuracy drops to below 70%. Of note, such analysis will help to better understand the capability of the diagnosis framework (i.e., both VS-TS and Deep Learning model) and assist clinicians with their level of trust and reliability towards this framework for CRC polyp classification. [0168] Choice of Hyperparameters for the RAPS Algorithm. As shown in Fig. 3 and Figs.9A-9C, once the model is calibrated, the next step is to introduce the conformal prediction step. While the NCP approach did not have any parameters that needed to be _{tuned, it was not the case for the RAPS. The RAPS algorithm had two parameters} ^{(i.e., λ} ^{and k} _reg ^{) that were optimally tuned. To find such} _{an optimal set of parameters, a} sensitivity analysis study was performed. In this experiment, the model performance of the two CRF models utilizing RAPS (i.e., CRF-2 and CRF-4) was evaluated in terms of _{coverage and average set size for all combinations of λ ∈} ^{^0.0001, 0.001, 0.01, 0.1, 1^} ^{and k} _reg ^{∈ ^0, 1, 2, 3, 4^. All values} _{are reported by averaging 100 trials for each of the 25} combinations. Figure 8 shows the results of the performed 2500 experiments. _{[0169] As can be observed, although convergence of the RAPS approach} ^was ^{independent of the tuning parameters (i.e., λ and k} _reg ^{) and for} _{all performed experiments,} the algorithm always converged, certain combinations led to larger predictive sets than the others (shown in Fig.12A). It was found that larger average predictive set sizes led to a loss in interpretability. In such cases, it became more likely that the model outputs two or three types of tumors for every prediction regardless of the associated confidences, thus making the decision-making very difficult and uncertain. Also, no discernable trends were found for average coverage (shown in Fig. 12B) over the hyperparameter space. Attorney Docket No.10046-568WO1 _{Nevertheless, choosing a higher value of λ along with an early cutoff parameter} ^k _reg ^{of 0} ^{or 1 led to much larger average set sizes than for other} _{combinations, which had mostly} similar performance. Therefore, for the rest of the analysis in this paper, the _{hyperparameters were set} ^{at λ ൌ 0.1 and k} _reg ^{ൌ 1. A t this combination, the CRF} _models were able to reach an average coverage of 0.905, while keeping the average set size at 1.7. [0170] Overall CRF Model Performance Analysis. To thoroughly study the performance of the proposed reliability framework, the changes in the evaluation metrics for all four CRF models were analyzed. Especially the two conformal prediction algorithms (i.e., NCP and RAPS) were taken into account and their user-defined error rate parameter α. Note that changing α was equivalent to setting the level of trust towards the pre-trained deep learning model from the clinicians’ perspective. In other words, a high α means that the base deep learning model output was reliable enough to be trusted as is, while a lower α corresponds to lower trust, which was followed by providing more information to the clinician in terms of an alternate prediction set and confidence numbers. From another perspective, the ability to intuitively interact with the CRF model and tune the parameter α demonstrates the level of trust in the pre-trained ML model or the level of conservativeness of the clinician to rely on the output of such models in detecting CRC polyps. [0171] As the first step, the accuracy of the base Dilated ResNet model was realized using Equation (9) to be equal to 80% over the validation dataset. Of note, this value was also inferred from the confidence histogram generated for CRF-1/2 (Fig. 10A), which plotted both average accuracy and average confidence. An accuracy of 80% means that the error rate of the base model was equal to 20%. This error rate was a property of the base model and cannot be changed unless training is redone. It is to be noted, therefore, that using the conformal prediction algorithms, the base deep learning model coupled with the CRF was able to reduce this error rate to a chosen level by generating prediction sets guaranteed to contain the true label. To verify this assertion, the average coverage and average set size were plotted for each of the four CRF models for three different user-chosen error rates (α) of 20%, 10%, and 1%. The results are presented in Tables 1, 2, and 3 and graphically compared in Fig. 13. Table 1. Performance results for the four CRF models with an error rate of α = 20%.

Attorney Docket No.10046-568WO1

Table 2. Performance results for the considered four CRF models with error rate of α = 10%.

Table 3. Performance results for the four CRF models with an error rate of α = 1%.

Attorney Docket No.10046-568WO1

[0172] As can be seen at α ൌ 20%, the average coverage (Fig.13B) was observed to reach 80%, with an average set size of 1 (Fig.13A) across all four CRF models. This was to be expected, as the conformal prediction framework only generates larger sets in order to guarantee the error rate. Since the accuracy of the base model was already 80%, there was no need to add another label to improve the coverage. In other words, this case was equivalent to the performance of the base model without CRF. Additionally, as mentioned above, choosing such a high alpha value demonstrated a less conservative clinician with a high level of trust in the result of the base model in detecting cancer polyps. [0173] Upon reducing the α to 10%, the set size increased for both frameworks. However, as can be observed in Figs.13A-13B, there was no difference due to the presence of the VTS layer. Both CRF-1 (non-calibrated) and CRF-3 (calibrated) models had an average set size of 1.5, whereas CRF-2 (non-calibrated) and CRF-4 (calibrated) models had an average set size of 1.7 (Fig.13A). This increase in the set size was an expected result since the conformal prediction framework was being asked to reduce the error rate (and increase the reliability of the deep learning model) to a level that was lower than that of the base model. The larger average set size of RAPS-based models (i.e., CRF 2 and CRF 4) versus NCP-based models (i.e., CRF 1 and CRF 3) was also an advantage of this algorithm, making it more likely to include a second prediction in the set when the base model was under-confident. Of note, these results also agreed with the previous literature on the average set size generated by these two methods [39] [49] [53]. [0174] Finally, on reducing α to 1% (or drastically reducing the trust in the base deep learning model’s output in correctly classifying the CRC polyp types), the average set size increased to over three for all CRF models (Fig. 13A). This is again consistent with the expectations that by setting the error rate to be significantly lower than that of the base model, any interpretability is sacrificed, which may potentially be gained by the conformal prediction framework. [0175] Overall, choosing an appropriate error rate is critical to maximize the reliability and usefulness of the predicted sets. Moreover, the addition of the VTS layer before conformal prediction does not degrade the performance of either of the algorithms while providing an additional layer of information to the clinicians. It is noted that changes Attorney Docket No.10046-568WO1 in the evaluation metrics for both conformal prediction algorithms as the error rate α is changed. Moving from 20% to 1%, the interpretability increases and then again decreases, as an average set size of 1 is just as useful (or not useful) as an average set size of 3. Between these upper and lower bounds, however, the clinician is free to intuitively interact with the pre-trained model and continuously choose an error rate as per their choice, depending on how much trust they have in the model output (or how conservative they are in using an AI algorithm for sensitive CRC diagnosis application), and how much information they wish to receive from it. [0176] Inter-class Performance Analysis of the CRF models. The class-wise coverage of the four CRC polyps (shown in Fig.7) using the four CRF models are summarized in Tables 1, 2, and 3 and visualized in Fig. 14. Analyzing these tables and figures reveals an interesting trend in the detection of the CRC polyps using the CRF models. Independent of the CRF model, the CRC polyp types A and R, which are both non-neoplastic (i.e., non- cancerous), are over-covered in the prediction sets, while types G and O, which are both neoplastic (i.e., cancerous), are under-covered. Referring to the values presented in Tables 1, 2, and 3 together with Fig. 14, given a prediction of non-cancerous polyps (i.e., type A or R) when the error rate is set to α = 10%, the clinician can be 99% sure that the generated predicted set contains the true polyp type. For cancerous polyps (i.e., type G or O), this number reduces to 82%. At the other α values of 20% and 1%, similar discrepancies are also seen, as the coverage for non-neoplastic polyps is 92% and 100% and that for neoplastic polyps is 70% and 98%, respectively. [0177] Here also, it is noted that for both conformal prediction algorithms, the addition of the VTS layer has no impact on the class-wise coverage. Due to the nature of this split between over-covered and under-covered classes, it is sufficient to say that the CRF model is well-suited to reliably distinguish between cancerous and non- cancerous polyps, and that the addition of VTS has no negative impact on performance. [0178] Comparison of synthetic datasets and attention visualizations. Referring now to Figs.17A-17D, correlation of synthetic and real data for asteroid, gyrus, oval, and round polyp types using t-distributed stochastic neighbor embedding (Fig.17A); multidimensional scaling (Fig.17B); principal component analysis (Fig.17C); and uniform manifold approximation and projection (Fig.17D) are shown. Synthetic and real data points cluster together, with synthetic data filling in the gaps between real data. This means that synthetic Attorney Docket No.10046-568WO1 data is a good representation of the real data. Correlation plots of testing and training data are shown in Figs.18A-18D. [0179] Testing of validation accuracies was carried out using a set of 182 real data for training and 55 real data for testing. Augmented training data was varied by adding X% synthetic data to real data, where X = [0, 0.1, 0.2, 0.3, 0.4, 0.5]. Fig.19A shows validation accuracies for different X values over time. It can be seen that training speed increases as the number of synthetic samples increases. Fig.19B shows accuracy for test datasets of different X values tested on blurry, noisy, and combined data. It can be seen that models trained on more synthetic samples have better performance on noisy and blurry data. [0180] Attention Visualization Method Testing. Comparisons of points of interest and heat maps generated by attention maps with self-attention (Fig.20A), GradCam (Fig.20B), SHAP Deep Explainer (Fig.21A), SHAP Kernel Explainer (Fig.21B). It was found that attention maps with self-attention, SHAP Deep Explainer, and SHAP Kernel Explainer are class specific. GradCam is class agnostic and produces the same heatmaps for each class. Attention maps produce the clearest visualizations. SHAP methods may miss points of interest and are computationally expensive. [0181] Clinical Interpretation of the CRF Models. As depicted in Fig. 3, the framework, which provides two independent cascade layers of reliability to a basic deep learning model, assists clinicians with an intuitive, reliable, and tunable tool for early-stage diagnosis of CRC polyps. The conformal predictive layer generates predictive sets that are guaranteed to contain the true polyp type with an adjustable error rate α, while the confidence calibration attaches a realistic number denoting the confidence to each predicted label. Thus, looking at these two layers together provides sufficient information for clinicians to decide whether or not the outputs of the base deep learning model are trustworthy. A few representative outputs of the proposed framework are presented in Figs. 15A-15F, generated using CRF-4, and the user-chosen error rate α = 10%. Of note, without loss of generality, these outputs clearly demonstrate the features of the proposed Cascade Reliability Framework. [0182] As shown in Figs. 15A-15F, the Cascade Reliability framework outputs a set including the predicted polyp types as well as their corresponding confidences. For example, Fig. 15A represents a case in which it labels the textural image with polyp types O and G as the predictive set, with their confidences being 57% and 32%, respectively. This indicates that although the model is more confident about the polyp being type O due to its relatively low level of confidence (i.e., 57%), it would be prudent to consider the other Attorney Docket No.10046-568WO1 label G as well. Figure 15B shows only partial contact has been made between the VS- TS and the polyp, and only part of its surface is visible. As shown, the CRF-4 model determines a predictive set including polyp type O (i.e., non-cancerous) and R (i.e., cancerous), both with low confidences of 40% and 40%. Of note, an investigation of the textural output and Fig. 7 also verifies the similarity between these two tumor types, making their distinction very difficult for the framework or the clinician. For this case and considering the framework output confidences and tumor types, the clinician is encouraged and has the option to perform a manual inspection and perhaps a follow-up biopsy procedure. This specific case clearly highlights the importance of the dual-layer reliability framework in providing complementary and intuitive information to the clinician. In Fig. 15C, the predictive set is again generated with two classes, A and G. However, the confidence attached to tumor type G is 83%, and that of A is only 13%. This indicates that G is the most likely prediction, and indeed, by visual inspection and checking the ground truth, that is the case. Both the cases of Fig. 15D and Fig. 15E have only one label in the predictive set, which indicates that other labels are extremely unlikely. Complete illumination and proper contact contribute to this high confidence. Fig.15F is similar to Fig.15C, with a predictive set consisting of types R and A with respective confidences of 73% and 20%. Therefore, Type R is considered to be the most likely and, indeed, is appropriate considering the ground truth. [0183] Overall, looking at the confidence values informs the clinicians whether or not to consider the polyp types other than the one which is most likely, thereby reducing the chances of error and the EDMR of CRC polyps as well as performing unnecessary biopsies. [0184] Discussion of Example System for CRC Diagnosis: According to the World Health Organization, Colorectal Cancer (CRC) is the third most prevalent cancer in both men and women, with an estimated 1.9 million cases and 935,000 deaths worldwide in 2020 [1]. Early detection of pre-cancerous lesions via colonoscopy, the gold standard for CRC screening, can significantly reduce patient mortality to about 10% [2]. Studies have shown that the morphological characteristics (such as shape, size, and texture) of CRC polyps observed during colonoscopy screening can be used to classify their types and also indicate the neoplasticity of a polyp [3], [4]. One of the classification systems developed based on these morphological characteristics observed during colonoscopy screening was introduced by Kudo et al. [5]. In this classification approach, utilizing magnified images obtained through colonoscopy, clinicians can inspect the surfaces of lesions to identify pit patterns, shown in Fig. 7 , including the Type I (Asteroid), Type II (Gyrus), Type III Attorney Docket No.10046-568WO1 (Oval), and Type IV (Round). Of these, Types I and IV are considered to be normal/hyperplastic, whereas Types II and III are neoplastic (i.e., with high cancerous potential). Nevertheless, these lesions can have a high degree of variations in morphological characteristics and visual appearances, making their detection process complex and examiner-dependent [6] [7]. Apart from this, the colonoscopy screening process itself also suffers from limitations in maneuverability, visual occlusions, and dependence on the expertise of the clinician [8]. These issues have contributed to an early detection miss rate (EDMR) as high as 20% and suggest a need for the development of new technologies to aid in the diagnostic process to reduce EDMR [9] [10]. [0185] Computer-aided diagnostics and artificial intelligence (AI) methods have shown promise in reducing the EDMR of CRC by improving the accuracy of screening tests and assisting endoscopists in detecting lesions during colonoscopy [11] – [14]. Early methods for polyp detection and classification utilizing classical computer vision techniques such as hand-crafted feature extraction and wavelet-based methods [12] [14] may achieve limited success due to the vastly different changes in illumination, occlusion, and appearances of polyps during colonoscopy [11]. With the advent of AI, algorithms such as support vector machines (SVM), k-nearest neighbors (k- NN), ensemble methods, random forests, and convolutional neural networks (CNN) [15]–[18] have shown promising results for the task of analyzing medical images (such as colonoscopy videos or CT scans) to detect and classify suspicious lesions automatically either in real-time [19]–[21], or as a form of post-processing once the data is collected during a colonoscopy [16]. More advanced and complex network architectures have also been considered for this task, such as Residual networks (ResNet), Densely connected CNNs (DenseNet), and AlexNet [11] [14] [22] [23] [24] [25]. Among these, the ResNet architecture has shown the most promising results in part due to its generalization qualities and ability to reduce the effects of the vanishing/exploding gradients [23] [26]. However, such models tend to perform poorly on limited datasets due to overfitting, impairing the model’s usefulness and reliability in clinical settings [27] [28]. They also react poorly to imbalanced datasets, which are more likely to be available as compared to balanced datasets, especially in the medical field [11]. Due to the difficulties in obtaining clinical records and medical images for generating datasets easily, transfer learning via deep neural networks trained on publicly available large datasets (such as ImageNet) has become a widely popular technique [11] [29]. These techniques transfer image recognition subtasks learned on the larger datasets to another Attorney Docket No.10046-568WO1 smaller, related dataset, whereas the two datasets do not need to belong to the same distribution [11] [29]. Transfer learning has been extensively used for polyp classification tasks on colonoscopy images [11] [29] [30]. [0186] To report the performance of the above-mentioned models, most of the literature typically utilizes standard statistical metrics such as accuracy, sensitivity, and precision [11] [27] [28] [29] [30]. However, a review of the literature supports the assertion that by using these statistical metrics, researchers mainly focus on the correctness of the algorithm, not on potential risks related to its use [31] [32] [33] [34]. This is of importance for sensitive medical applications such as CRC polyps’ diagnosis, as it is essential to reduce EDMRs. More generally, in AI use cases in settings with the potential for serious harm to people, requirements such as reliability, assurance, transparency, and meaningful estimates of confidence may aid clinicians in making more intuitive and informed decisions. [0187] Guo et al. [32] first brought to light the idea that modern deep neural networks are miscalibrated, meaning that there is a difference between the predicted SoftMax outputs and the ground truth probabilities. In other words, it is often erroneously assumed that the output of the prefinal classification layer (i.e., the SoftMax layer) is a representation of the model confidence [31] [35]. Additional processing is, therefore, necessary for these SoftMax outputs to be used as measures of confidence. To this end, Guo et al. [32] proposed simple post-processing techniques for calibration, but other techniques (which involve modifying the loss function) have also been explored, including but not limited to, the difference between confidence and accuracy (DCA) [33] and dynamically weighted balance (DWB) [34] methods. With regards to CRC polyps’ detection, Carneiro et al. [36] used temperature scaling to calibrate a deep neural network for classification based on colonoscopy images, while Kusters et al. [37] used the trainable methods of DCA and DWB for the same purpose. In a previous study [38], the use of confidence-calibrated neural networks was explored, which provided an accurate measure of the probability of correctness of any prediction while finding types of CRC polyps. Using the post-processing method of temperature scaling presented by Guo et al. [32], this is achieved by attaching a confidence metric to each prediction. This technique significantly improved the interpretability of the model, which could better help a clinician to know how much to trust the algorithm’s inference. Through reliability diagrams, we also showed that the temperature-scaled Softmax outputs can better represent the model prediction accuracy than an uncalibrated model. Attorney Docket No.10046-568WO1 [0188] Despite the effectiveness of a calibrated network in providing accurate information about the level of certainty of the prediction compared with the typical use of neural networks, calibrated networks often solely provide a singular classification as the output and do not provide any information on what else the prediction could be. In other words, these networks may paint an incomplete picture by providing a singular point prediction and not a set of predictions, and if the network prediction is wrong or the prediction’s confidence is too low for the clinician to accept, there is no further information to inform the clinician. Nevertheless, for low-confidence predictions, the clinician may need to consider other potential predictions. Therefore, if the AI algorithm provides the clinician with a set of predictions, it can assist them to have a more intuitive interaction with the model and make more informed decisions. Particularly, when applied to the critical application of CRC diagnosis, this interaction and additional information may reduce EDMR as well as the need for performing additional biopsy procedures to increase certainty. Overall, these motivating factors highlight the need for moving from a typical point to a set of predictions backed by the probability of the correctness of these outputs. [0189] To address the above-mentioned requirements, conformal prediction has recently been introduced as a framework to not only generate predictive sets but also to have the property of the predictive sets mathematically guaranteed to satisfy a required (user- defined) confidence level [39] [40]. In other words, the conformal prediction framework generates a set of predicted labels that are guaranteed to contain the true label with a user-specified error rate. This makes the system highly adaptive and problem-specific. Additionally, as mentioned before, it provides an extra layer of clinician-specific control by letting them choose the error rate and, by extension, how much they trust the model outputs (or, conversely, how conservative they are about using AI for sensitive CRC diagnosis applications). The framework has been recently explored within the context of medical imaging and diagnostics. A comprehensive literature review by Vasquez et al. [41] reveals that several algorithms, such as Inductive Conformal Predictors [42], Dynamic Conformal Predictors [43], Inductive Confidence Machine [44], Mondrian Conformal Predictors [45], Label-Conditional Mondrian Conformal Predictors [46], have all been used in varied applications. For example, Papadopoulos et al. [46] explored the use of Mondrian Conformal Predictors for stroke risk estimation based on ultrasound carotid images. This study relied on data from clinical and demographic variables along with ten features extracted from images. Additionally, Alnemer et al. [45] demonstrated that conformal prediction corrections always improved accuracy, sensitivity, and precision in the context Attorney Docket No.10046-568WO1 of breast cancer survivability, regardless of the classifier used. Luo et al. [43] also introduced the concept of Dynamic Conformal Prediction over an SVM-based classifier to detect arrhythmias. [0190] Overall, a review of the literature reveals that in prior studies, either confidence calibration or conformal prediction has been used as the sole method of uncertainty quantification. Nevertheless, as mentioned, either of these approaches has its own non-intersecting benefits that make their use beneficial for the clinical and particularly CRC diagnosis application. Therefore, a Cascade Reliability Framework (CRF) has been presented for informed and intuitive clinician-AI interaction in CRC diagnosis that is both calibrated and capable of forming set predictions. As opposed to the traditional one-layer uncertainty quantification (e.g., [36] [37] [46] [45]), the CRF combines confidence calibration and conformal prediction as two independent layers in which, as shown in Fig.1, the outputs from the calibration layer (i.e., the temperature scaling layer) are cascaded into the conformal prediction layer. Of note, each of these layers provides useful information (i.e., realistic probability values from confidence calibration and actionable set predictions from conformal prediction) to significantly increase the explainability, interpretability, intuitiveness, and level of trust for the clinicians. More specifically, by attaching calibrated confidence estimates to each predicted label in a set generated for a particular input image, clinicians are informed about the true likelihood of each label while also mathematically guaranteeing that the true label is contained within the predictive set within a predefined clinician-selected error rate. Moreover, generating a set of predictions using conformal prediction can also inform the clinician about the next most likely label, thus further reducing the chances of a cancerous polyp being missed during diagnosis. Example 2: Enhancing Colorectal Cancer Diagnosis Through Generative Models and Vision-Based Tactile Sensing: A Sim2Real Study [0191] A problem faced by Deep Learning approaches is having access to labeled data. Gaining access to labeled data poses a significant challenge, particularly in medical image analysis, given the time-consuming and expensive nature of annotating medical images, a task that demands specialized expertise [21]. Consequently, exploring alternative strategies for obtaining extensive and well-balanced datasets becomes imperative. One viable approach involves the creation of synthetic images through the use of generative models such as Generative Adversarial Networks (GANs) [22], diffusion models [23], Variational Attorney Docket No.10046-568WO1 Autoencoders (VAEs) [24], and flow-based methods [25]. Synthetic data has the advantage of being inexpensively labeled and diverse. Furthermore, any combination of data may be generated to represent specific scenarios as per task-specific requirements. These datasets also offer a potential solution to challenges related to privacy concerns and can serve as a means to navigate ethical and legal obstacles associated with sharing image data [21, 26]. It’s important to emphasize that the quantity of synthetic data needs to be determined heuristically based on the specific requirements of the task. Notably, in the realm of colorectal cancer, the exploration of synthetic data generation has primarily focused on histopathological images [27, 28]. [0192] To address the aforementioned issues regarding cancer screening during colonoscopy,a Vision-Based Tactile Sensor, called HySenSe [9], had been developed, and its performance evaluated through various experiments [9], [29], [30] (See Fig.23). As shown in Fig.7, instead of typical colonoscopy images, HySenSe generates high-resolution textural images of the CRC polyps. The main issue with using this unique sensor, however, is the lack of access to sufficient images for training a machine learning algorithm to sensitively and accurately classify CRC polyps. Such textural images also cannot be reproduced through publicly available means. This issue was partially circumvented by opting to use realistic 3D-printed polyp phantoms to train the ML classifiers; however, the polyp fabrication process is labor-intensive and time-consuming, making it difficult to make a larger dataset at a time when large, balanced datasets are a requirement for training most modern deep- learning based architectures. Due to these reasons, we look for a solution in synthetic data augmentation techniques, specifically generative models for image generation. To address this important limitation, the use of generative models to augment HySenSe textural images is proposed. In particular, UNet2D-based Denoising Diffusion Probabilistic Models (DDPMs) [31] were used for this purpose, as well as training dedicated generative models for each possible class and extending this concept to features present across classes (such as the degree of contact of the polyp with the sensor). Several classification models with different complexities were trained, and their performance was evaluated, with the caveat that training was done on synthetic DDPM-generated images and testing only on real images (i.e., performing a Sim2Real). Through the following results and various metrics, the effectiveness of utilizing synthetic images during training and the limitations of the complexity of the classifier model required when synthetic images are involved were demonstrated. [0193] Materials and Methods Attorney Docket No.10046-568WO1 [0194] Realistic Polyp Phantoms. As described above, to facilitate the training and evaluation of the models, various realistic CRC polyp phantoms were designed and additively manufactured based on the details described in Venkatayogi et al. [9, 29]. These phantoms were then used to generate the synthetic dataset. In designing these polyp phantoms, Kudo’s pit pattern classification was followed [5], and four polyp types, including Asteroid (A), Gyrus (G), Oval/Tubular (O), and Round (R) textural geometries, were fabricated. Of note, in usual clinical settings, Kudo’s classification comprises five distinct polyp types. Among these, Types I, II, and IV exhibit unique pit patterns Asteroid (A), Round (R), and Gyrus (G), respectively, and were treated as separate classes. Subtypes IIIS and IIIL were combined into the Oval (O) class, and Type V was considered to be an arbitrary mix of the other four classes. Furthermore, since types A and R are non-neoplastic (non-cancerous), and types G and O are neoplastic (cancerous), an even mix of cancerous versus non-cancerous polyps in the dataset was ensured. With ten geometric variations across four materials with varying hardness, 160 (= 4×10×4) unique polyp variations were designed and planned. The conceptual representation of these variations is depicted in Fig.7. Here, each distinct polyp is denoted as P(i, j, k), where i ∈ 1, 2, 3, 4 signifies the tumor type, j ∈ 1, 2, ...10 denotes geometric variation, and k ∈ 1, 2, 3, 4 indicates hardness. The feature dimensions of the phantoms range from 300 to 900 µm, with an average spacing of 600 µm between pit patterns. Of note, in the example above [9, 30], it was shown that the VS-TS sensor was able to detect these minute textural features of realistic polyps regardless of their classification standards (e.g., Kudo or Paris classifications) and textural patterns. The polyps, designed in SolidWorks (SolidWorks, Dassault Systemes), were printed using the J750 Digital Anatomy Printer (Stratasys, Ltd). More details about the polyp fabrication process can be found in Venkatayogi et al.[29]. [0195] Vision-based Tactile Sensor (VS-TS) and Data Collection. The dataset used in this study was collected using a VS-TS, called HySenSe. As shown in Fig.23, feature 2380, the sensor consists of (I) a soft silicone membrane interacting directly with the target surface, (II) a small camera (Arducam 1/4 inch 5 MP) that captures the minute deformations of the silicone layer, (III) a transparent acrylic plate providing support to the silicone layer, (IV) an array of Red, Green and Blue LEDs 2340 to provide internal illumination for depth perception, and (V) a rigid frame supporting the entire structure. The HySenSe relies on capturing the deformations caused by the interaction of the soft deformable membrane with a target surface using the embedded camera. Importantly, the HySenSe sensor is capable of delivering high-fidelity textural images with consistency across various attributes such as Attorney Docket No.10046-568WO1 surface texture, hardness, type, and size of polyps. This capability holds true even at extremely low interaction forces. These qualities position the sensor as an ideal tool for capturing intricate textural details of CRC polyps, enabling the application of a complementary ML algorithm for stiffness classification. Further information regarding the fabrication and operational characteristics of this sensor, specifically in the context of obtaining textural features and stiffness classification, is available in the previous example [9], [32]. Recently, based on this sensor, a vision-based tactile sensing balloon that can be integrated with existing colonoscopy devices for performing CRC screening was developed [33], [34]. [0196] For this study, the experimental setup shown in Fig.23 was used. The experimental setup comprised several components, including (2310) the VS-TS HySenSe, (2320) a precision linear stage with 1 µm precision used to attach and push polyps onto the deformable gel layer of HySenSe (M-UMR12.40, Newport), (2330) a Digital Force Gauge (Mark-10 Series 5, Mark-10 Corporation) with 0.02 N resolution employed to measure the interaction forces between the gel layer and individual polyps, (2340) a Raspberry Pi 4 Model B for video recording, data streaming, and further image analysis. MESUR Lite data acquisition software (Mark-10 Corporation) was used to record the forces. To simulate full and partial contact between the sensor and polyps, textural images of polyp phantoms were captured at two contact angles (0° and 45°) with a force of 2 N in the vertical direction. Notably, at 0°, the sensor replicates a full interaction with the polyp, while at 45°, it captures partial interactions, thereby constraining the observable textures in the image. All 160 unique polyps were used for the 0° orientation, whereas half of the variations from each polyp class chosen randomly across the four materials were used for the 45° orientation. After rejecting the unusable images, the final dataset consisted of 229 samples, with 57, 57, 55, and 60 visuals for classes A, G, O, and R, respectively. [0197] Generative model for Data Augmentation. A significant challenge in developing AI models for the medical domain is the scarcity of large and balanced labeled datasets, primarily stemming from the complexities associated with obtaining patient records [13]. This problem becomes even more critical if a new imaging device and modality (such as the disclosed VS-TS) is introduced to the clinical community. To address the issue of a balanced dataset for the sensor, 3D-printed realistic polyp phantoms were created and utilized. These phantoms enabled the capture of textural images using the disclosed VS-TS, eliminating the reliance on real patient data. However, the process of designing and manufacturing polyps itself is intricate and time- consuming, making it challenging to Attorney Docket No.10046-568WO1 generate the substantial volume of data typically demanded by modern data-intensive deep network architectures. Furthermore, extensive datasets play a crucial role in mitigating overfitting and ensuring the model’s ability to generalize across unfamiliar inputs. Given the limitations of an available balanced dataset, synthetic data generation techniques to augment our dataset were explored. [0198] In this study, a Denoising Diffusion Probabilistic Model (DDPM) architecture [23] was used for generating textural images akin to the output of our HySenSe sensor. A version that utilizes a UNet2D model [31] coupled with a Scheduler model was also used. Both these models and the overall architecture were retrieved from Huggingface [35]. Diffusion models, which are neural-network-based models trained to predict less noisy images from noisy inputs, offer the ability to convert random noise into images during inference [31]. Although initially designed for image segmentation, the UNet architecture was favored in diffusion models due to its matching input-output dimensions owing to its U- shaped encoder-decoder structure. In this work, an input/output image size of 128 × 128 pixels was used. To facilitate the training of a UNet-based diffusion model, it was essential to employ a scheduler [23]. This scheduler iteratively refined a noisy signal using the UNet Denoiser, simultaneously learning a conditional distribution that maps the current image to the target data. The condition, in this context, can take the form of image similarity, text, or labels. Of note, in this particular study, image similarity was used as the conditioning factor. Subsequently, this learned distribution was applied for the purpose of image generation. Data generated using only image similarity with no conditioning was unlabeled, and the model produced unlabeled random variations of the input dataset. However, generating clinical images alone was insufficient; labelling the data enhanced its utility [36]. In this study, because the number of possible polyp types was fixed and known, the naïve approach of unconditional image generation utilizing an ensemble of DDPMs was used so as to fix the ground truth labels for images generated by each DDPM. [0199] Referring now to Fig.25, the sequential steps in noise-controlled data generation and the progressive refinement of synthetic images, highlighting the model’s capacity for capturing intricate details in the data, are shown. The top sequence shows the forward diffusion process, executed by a scheduler, where a clear image gradually becomes noisier until it turns into random noise. The lower sequence depicts the reverse diffusion process, facilitated by a UNet2D architecture, where the noise is progressively reduced to reconstruct the original image. The block on the right depicts the conditioning criteria used to Attorney Docket No.10046-568WO1 learn the denoising process. In this work, the conditioning is performed solely on the input images [0200] Classification Models. To assess the usability and ML architecture agnostic feature of the generated synthetic data, three ML-based classification models were considered: Dilated ResNet, standard ResNet [18], and VGG [16]. The first considered architecture was a dilated ResNet, which has been explored previously for the purpose of classification of cancerous polyps [29]. Given their capacity to alleviate the exploding gradient problem [16], [37], Residual Networks (ResNets) have become a common architectural choice for CRC polyp classification tasks [38]. The integration of skip connections in ResNets further addresses the degradation problem, enabling the construction of deeper and more complex models without compromising performance [16]. Dilated convolutions have been explored towards maximizing detail extraction from images [39]. This technique broadens the kernel field of view, enhancing sensitivity to minute details while preserving the spatial resolution of feature maps [39]. This proves critical for capturing intricate textual details essential for pit-pattern classifications, particularly in scenarios where images exhibit significant similarities except for certain textures [5]. The combination of dilated convolutions with transfer learning approaches has demonstrated promising results in dealing with limited datasets [19], [20]. Although conventional ResNets do not typically utilize dilated kernels, as described in the previous example, the architecture shown in Fig.6 show-cased the effectiveness of employing dilated ResNets for capturing and classifying textural details in the dataset. [0201] The other two architectures considered for the purposes of comparison—a standard ResNet [18] and VGG [16]—are both commonly used for image classification tasks in the medical imaging domain [40]. Of note, the classification heads of both these models were modified to account for the number of classes in the dataset. In this study, all three considered models underwent pre-training on the ILSVRC subset of the ImageNet [41] database, encompassing 1,281,167 training images, 50,000 validation images, and 100,000 test images across 1000 object classes. This pre-training phase enabled the models to acquire general image features, which were then directly leveraged for the specific downstream task. Subsequently, each model underwent fine-tuning using the custom dataset, employing these general-purpose learned features as a starting point. This fine-tuning step was essential due to the unique characteristics of textural images in the dataset, necessitating additional training. [0202] Testing Attorney Docket No.10046-568WO1 [0203] Synthetic Data Generation. In this work, unconditional image generation was used to generate the additional synthetic data. In this approach, each polyp class was further divided into images with whole contact and partial contact and then trained dedicated diffusion models for each subclass. In other words, one diffusion model generated images of a particular polyp type, mimicking a scenario of either whole contact or partial contact. This ensured that all generated images had fixed ground truths. Each DDPM corresponding to a sub-class was trained for a maximum of 1500 epochs. During training, a generated validation set of images was visually evaluated, and a checkpoint was saved every 100 epochs. Once the validation set of generated images (at a particular checkpoint) reached acceptable levels of visual fidelity, the weights at that checkpoint were used for the final diffusion model. The final trained models were used to generate 600 images for each subclass. This resulted in a total of 4800 evenly mixed synthetic textural images across the four possible polyp types with whole and partial contact. [0204] Classifier Performance. Different models were used to determine the efficacy of utilizing synthetic data during training and to confirm that the dataset was agnostic to the kind of classifier used. Specifically, by varying the amount of synthetic data used, it was observed that the effect on training time and overall model accuracy was affected. The steps to perform this analysis are provided in Algorithm 1 (Fig.24). Data augmentation techniques were utilized for training all models. The textural visuals from HySenSe were cropped and centered to isolate the specific polyp texture of interest and then downsized to 224 × 224 pixels to enhance model performance. In the case of the synthetic images (at 128 × 128 pixels), linear upscaling was utilized to bring them up to the same input shape. Following this, geometric data augmentation techniques, including random cropping, horizontal and vertical flips, and random rotations between -45° and 45°, each with an independent occurrence probability of 0.5, were applied. To further improve the model’s generalization capabilities, Gaussian blur and Gaussian noise were introduced with strengths σ ranging from 1 to 256 for blur and σ from 1 to 50 for noise. This decision was influenced by findings from previous work [42], which indicated that training the model on blurry and noisy data can enhance the calibration of confidence levels in predictions. The selected maximum levels were chosen to exceed the worst-case scenarios the model might encounter in a clinical setting. The probability of these Gaussian transforms was also fixed at 0.5 for the entire dataset. [0205] All the models were trained using the same hyperparameters in order to maintain consistency during evaluation. A batch size of 32 was used, and the learning rate Attorney Docket No.10046-568WO1 was fixed at 0.0001. The optimizer was also fixed to be the Adam optimizer [43]; all models were trained to 50 epochs each; checkpoints were saved after each epoch. The best- performing weights for each model configuration (model architecture and amount of synthetic data used) were selected for further comparison. [0206] Results [0207] Generative Model Performance. For this work, a qualitative evaluation of the synthetic images generated by each DDPM for all eight subclasses was performed. Visual inspection of these images, as illustrated in Fig.27, revealed the production of high-quality synthetic images that were largely indistinguishable from real textural images. It was observed that clear pit-patterns were visible, and the RGB lighting characteristic of the sensor output was also present. This observation held particularly true for the models designated exclusively for flat images, where the images generated were sharp with clearly distinguishable features. Comparatively, synthetic images generated for partial contact exhibited blurriness and lack of definition when compared to the generated flat images. Furthermore, it was observed that the models had difficulties in recreating the shapes of the polyps; they often generated images that lacked a definite shape even though the textural pattern itself was sufficiently captured. Importantly, the corresponding models also took longer to train in order to reach acceptable levels of visual fidelity. Nevertheless, this intentional introduction of less-than-ideal images serves the purpose of enhancing model generalizability. Consequently, all generated images were used for training. Overall, the generative models were capable of producing synthetic data that was comparable to real images, and thus, they were used for training the candidate ML classifiers. [0208] Effect of Synthetic Data on Model Training. The training and validation accuracy curves for each model trained on X percentages of synthetic data, where X ∈ {0, 10, 20, ...90, 100}, is depicted in Fig.28. The metrics of the results of this experiment are also summarized in Tables 4, 5, and 6, where “best performance” refers to a model that has high validation set accuracy and minimum overfit. A discernible trend emerged from the results: as the proportion of synthetic samples utilized for training increased, the training speed experienced a significant acceleration. Furthermore, the models were also able to reach a higher peak validation accuracy. Looking at the plots in Fig.28, when relying solely on real textural images, all models (1) reached saturation at around 40 to 50 epochs and (2) overfitted on the training set. This is evident from the gaps between the validation accuracy versus training accuracy curves in both cases. As synthetic data was gradually introduced into the training set, not only did the final model validation accuracy increase, but the time taken to Attorney Docket No.10046-568WO1 reach training saturation also reduced. Considering the case of the dilated ResNet, at X = 10%, the model accuracy reached a peak of around 93%, with saturation reached at 30 epochs instead of 40 epochs. Increasing X to 20% allowed the model to reach 96% accuracy by only 20 epochs. This trend of increased model accuracy and reduced training time persisted with successive additions to the training dataset when considering the other two models as well. For the ResNet [18], when no synthetic data was used, the training time was around 20 epochs, with a peak validation accuracy of 90%. After introducing synthetic data, however, saturation was reached much quicker and after fewer epochs. It was also noticed that beyond X = 20%, the gains were marginal. Similarly, without synthetic data, the VGG [16] reached a saturation validation accuracy of 90% at 20 epochs. Adding synthetic data again increased the training speed to less than 10 epochs. These trends are more clearly noticeable when considering the number of epochs to reach 60%, 80%, and 90% validation accuracy, as reported in Tables 4, 5, and 6. A consistent decrease in the required number of epochs across all three models was observed as the amount of synthetic data increased during training. A vast majority of the models breached the 60% threshold under 5 epochs when training with less synthetic data, and this was reduced to just a single epoch for X > 40%. After X = 40%, the 90% threshold was also breached at under 10 epochs. This is a clear indication of the increased speed in training as more and more synthetic data was introduced. [0209] At this point, there must also be a discussion on the amount of overfitting for each model. Considering the Dilated ResNet, when there was no synthetic data involved, while the validation accuracy did reach 90%, the training accuracy was almost 99%. This implied substantial overfitting, indicating that training on limited data was susceptible to biases and poor generalized performance. This was an expected result. While overfitting can be mitigated by using early stopping at a point where the difference between training and validation accuracy is less than a predetermined threshold, model performance is sacrificed in doing so. As synthetic data was introduced into the mix, this difference was also reduced along with an overall increase in validation accuracy. This is an indication that the model was no longer overfitting. Similar trends were observed in the other two models, ResNet18 and VGG16, as well. Since the models achieved saturation quickly, an appropriate checkpoint was selected that simultaneously had high validation accuracy and less overfit. These results are reflected in Tables 4, 5, and 6. Therefore, utilizing the synthetic data allowed the model to learn the more general features (such as pit patterns specific to each CRC polyp type) without sacrificing overall model performance. Attorney Docket No.10046-568WO1 [0210] Finally, the amount of time required for training decreases with the increased proportion of synthetic data, which is depicted in Fig.29, which plots the validation set accuracy of each of the models versus the amount of synthetic data used during training. Note that in this graph, for each value of X, the model weights were chosen from the set of saved checkpoints such that validation accuracy was maximized, and overfit was minimized. All three models reached their maximal achievable performance at X = 40%; there were no gains in accuracy after this point. Additionally, beyond X = 40%, the dilated ResNet reached saturation at under 10 epochs with a peak accuracy above 96%. Similarly, both ResNet18 and VGG16 achieved saturation within 5-7 epochs of the fine-tuning process. Therefore, using more than 40% of the total number of available synthetic images (which amounts to 1920 synthetic textural images), specifically for the textural dataset and CRC polyp classification problem, is not required. Beyond this point, the gains are marginal and not worth the cost of computation. [0211] In conclusion, this study demonstrates the application of generative models to create realistic textural images of CRC polyps, addressing the critical need for diverse and balanced datasets in medical machine learning. This approach involved the training of generative models on existing medical data and the subsequent generation of synthetic samples, demonstrating successes in enriching dataset quality. Of note, the use of synthetic images augments classification performance and diminishes model biases. These findings hold great significance, particularly for compact, resource-constrained medical devices, where efficient yet accurate models are paramount. Table 4. Performance of Dilated ResNet versus X% Synthetic Data

Table 5. Performance of the ResNet18 with X% Synthetic Data Attorney Docket No.10046-568WO1

Table 6. Performance of VGG16 with X% Synthetic Data

Example 3: Robot-Enabled Machine Learning-Based Diagnosis of Gastric Cancer Polyps Using Partial Surface Tactile Imaging [0212] In this example, the disclosed Vision-based Tactile Sensor (VTS) and a complementary Machine Learning (ML) algorithm were used for classifying gastric polyp tumors to address the existing limitations on endoscopic diagnosis of Advanced Gastric Cancer (AGC) Tumors. By leveraging a seven-degree-of-freedom robotic manipulator and unique custom-designed and additively-manufactured realistic AGC tumor phantoms, the advantages of automated data collection using the VTS addressing the problem of data scarcity and biases encountered in traditional ML-based approaches are demonstrated. The synthetic-data-trained ML model was successfully evaluated and compared with traditional Attorney Docket No.10046-568WO1 ML models utilizing various statistical metrics even under mixed morphological characteristics and partial sensor contact. [0213] Introduction [0214] Gastric cancer (GC) is the fifth most commonly diagnosed cancer worldwide and the fourth leading cause of cancer-related mortality [1]. A major contributor to this challenge is the fact that a substantial portion—up to 62%—of GC cases are detected at advanced stages, contributing to poorer overall survival rates compared to cases identified at early stages [2]. Upper endoscopy is the primary method for the initial detection of GC lesions as it allows for an inside view of the gastric tract lining where tumors originate. At the advanced GC (AGC) stages, tumors have infiltrated the muscularis propria [3] and can be identified and classified through their morphological characteristics (i.e., their geometry and texture) visible through the images provided by an endoscope. Borrmann classification [3] is a common approach used by clinicians to morphologically classify GC polyps into four types of polypoid (Type 1), fungating (Type 2), ulcerated (Type 3), and infiltrating or Flat (Type 4) (see Fig.31). Nevertheless, inter-class variance of each type of polyps and solely relying on morphology of the GC polyps in Borrmann classification has resulted in a high-degree of disagreement and inconsistency in decision- making among clinicians [4]. Therefore, long- term specific training and experience is needed to detect GC properly using endoscopic images and Borrmann classification [5]. Furthermore, similar to many vision-based endoscopic diagnoses (e.g., colonoscopy and laparoscopy), the limited resolution of endoscopic video cameras, visual occlusions, lack of sufficient steerability of the endoscopic devices, and lighting changes within the body make reliable diagnosis of GC polyps even more challenging [6]. [0215] To address the above-mentioned limitations, Artificial Intelligence (AI) methods utilizing Machine Learning (ML) have been employed in different modalities, such as histopathological images or endoscopic videos, to detect and classify GC tumors. For example, Li et al. [7] used a custom Deep Learning (DL) based framework for automatic cancer identification from histopathological images. Using the same modality, Huang et al. [8] developed an in-house DL approach—GastroMIL—for differentiating between cancerous and healthy tissue. For endoscopic videos, Hirasawa et al. [9] used a CNN-based Single Shot MultiBox Detector to successfully detect cancerous lesions. Taking a step further to address the limitations of traditional endoscopy, Xia et al. [10] used a magnetically controlled capsule endoscope coupled with an ML model for the in vivo classification of GC. A common limitation of such ML approaches is the limited availability and access to large, balanced Attorney Docket No.10046-568WO1 datasets [11]. This has also been recognized by The American Medical Association, which passed policy recommendations in 2018 for identifying and mitigating bias in data during the testing or deployment of AI/ML-based software to prevent introducing or exacerbating healthcare disparities [12]. DL approaches especially require large amounts of data to be able to generalize over unknown inputs. This is particularly important in the medical domain, where there are several hurdles to obtaining patient data, such as time dependence, availability of relevant clinical cases, and privacy concerns [13]. Limited data can lead to (1) spectrum bias and (2) overfitting [13]. Furthermore, in the case of AGC tumors, there is a very high degree of inter-class variance in the morphological characteristics, which can vary considerably amongst patients, hindering the preparation of a rich, well-balanced dataset [4]. This issue of limited data can be partially mitigated through transfer learning, which has now become a staple of modern DL frameworks [14]. However, the limitation of class imbalance persists through this technique. This is important to consider since even if datasets mirror the real-world distribution, rarer cases may not have adequate representation. In such scenarios, where rare cancer cases are both inherently infrequent and underrepresented in the data, ML models may struggle to learn the distinct features associated with these cases. [0216] To address the aforementioned limitations of existing endoscopic procedures, the disclosed Vision-based Tactile Sensor (VTS) (shown in Fig.30) is used with the surface tactile imaging modality for early diagnosis of colorectal cancer (CRC) polyps [15], [16], [17]. As opposed to normal endoscopic images, VTS provides high-resolution, approximately 50 µm textural images of the polyps that can improve their classification. However, the size of CRC polyps is smaller than AGC tumors (i.e., ∼1 cm×1 cm versus ∼4 cm×4 cm) [18], [3]. As shown in Fig.30, since the sensing area of the VTS is limited, only partial textural images of AGC tumors can be captured using VTS, making their data collection and classification very challenging. [0217] To collectively address these limitations and develop an ML-based diagnostic assistance for AGC using the disclosed VTS, the present example utilizes the VTS in classifying AGC tumors using their textural features; a complementary ML-based diagnostic tool that leverages this new modality to sensitively classify AGC lesions; and a robot-assisted data collection procedure to ensure the ML model is trained on a large and balanced dataset. The ML models are trained on partial textural data semi-autonomously collected from 3D- printed AGC tumor phantoms. Statistical metrics are used during evaluation to show that the proposed ML models can reliably and sensitively classify the AGC lesions even under mixed morphological conditions and partial tumor coverage. Attorney Docket No.10046-568WO1 [0218] Materials and Methods [0219] Vision Based Tactile Sensor (VTS). In this study, the previously disclosed VTS called HySenSe was used, as outlined in [21], to acquire high-fidelity textural images of AGC tumor phantoms. As depicted in Fig.30, the HySenSe sensor comprises: (I) a flexible silicone membrane interacting directly with polyp phantoms, (II) an optical module (Arducam 1/4 inch 5 MP camera) capturing minute deformations in the gel layer during interactions with a polyp phantom, (III) a transparent acrylic plate offering support to the gel layer, (IV) an array of Red, Green, and Blue LEDs for internal illumination aiding depth perception, and (V) a sturdy frame supporting the entire structure. Operating on the principle that the deformations resulting from the interaction between the deformable membrane and the surface of AGC tumors can be visually captured, the HySenSe sensor provides high-fidelity textural images, demonstrating proficiency across various tumor characteristics such as surface texture, hardness, type, and size [17]. This capability was maintained even at extremely low interaction forces, as detailed in [21]. Additionally, due to the arrangement of the LEDs within the sensor, different deformations had different lighting. This means that if the interaction force is fixed, the textural images implicitly encode the stiffness characteristics of the surface in contact as well. Due to these advantages, in the previous examples and in [15], [16], [17], this sensor has been used to differentiate between classes of CRC polyps, which are also distinguished by their morphological characteristics, namely the surface texture presented and the polyp stiffness. However, at this point, it must be noted that CRC polyps and AGC tumors differ greatly in size (i.e., 1 cm×1 cm versus 4 cm×4 cm). Since building a larger HySenSe sensor is impractical in a clinical setting, the sensor coverage area became a limiting factor. In its current form, the area coverage is about 4 cm² (see Fig.30, 3040). Therefore, it is impossible to entirely cover an average AGC tumor. To address this issue and cover a sufficient amount of data, a semi-autonomous robotic system was employed to sequentially cover the whole AGC surface and collect data. [0220] Realistic Tumor Phantoms. Towards addressing the limited availability of large, balanced clinical datasets in medical imaging and given this new imaging modality, realistic approximations of AGC tumors were designed and manufactured (see Fig.31). To design the phantoms without loss of generality and evaluate the performance of the utilized VTS, different types of Borrmann’s classification system were used. The four types specifically are Type I: fungating type; Type II: carcinomatous ulcer without infiltration of the surrounding mucosa; Type III: carcinomatous ulcer with infiltration of the surrounding mucosa; Type IV: a diffuse infiltrating carcinoma (linitis plastic) [3]. It is known that the Attorney Docket No.10046-568WO1 stiffness of the affected area is more than that of the surrounding regions, which makes it easier for clinicians to differentiate between tumorous sections and healthy tissue [3]. However, this classification has considerable overlap (especially between Type II and Type III) due to the mixed morphological characteristics of these lesions, making a manual diagnosis through observation difficult [4]. [0221] Fig.31 illustrates a few representative fabricated AGC lesion phantoms and their dimensions. To avoid the issue of data imbalance, in contrast with real patient datasets, each tumor class was equally represented in the dataset by designing 11 variations of each class (total 44 polyps). As shown in Fig.31, based on the realistic AGC polyps, the designs were first conceptualized in Blender Software (The Blender Foundation) to make use of the free-form sculpting tool, then imported into Solidworks (Dassault Systemes) in order to demarcate the different regions with varying stiffness (tumor versus healthy tissue). The high-resolution, realistic lesion phantoms were manufactured using a Digital Anatomy Printer (J750, Stratasys, Ltd) and materials with diverse properties: (M1) Tissue Matrix/Agilus DM 400, (M2) a mixture of Tissue Matrix and Agilus 30 Clear, and (M3) Vero PureWhite. Hardness measurements were obtained using a Shore 00 scale durometer (Model 1600 Dial Shore 00, Rex Gauge Company). M1 has Shore hardness A 1-2, M2 has A 30-40, and M3 has D 83-86. These differing material properties allowed the tumor sections (using M2) to be made stiffer than the surrounding healthy tissue (using M1). M3 was used to print the supporting rigid backplate to be mounted onto the robot flange. Each tumor was printed across a working area of 3 cm×3 cm, which represents the lower end of AGC tumor sizes. [0222] Experimental Setup and Robotic Data Collection. In the previous examples utilizing HySenSe for CRC polyp classification (see also [15], [16], [17]), the high-fidelity textural images were captured manually using a setup including a force gauge mounted on a linear stage. This manual and tedious procedure limited the data collection capabilities, which were not efficient for this study. Furthermore, AGC tumors are much larger than CRC polyps, making it impossible for the VTS to capture the whole textural area of the tumor phantoms. To overcome this limitation, a robotic manipulator was used to automate the image collection procedure. Using the robotic arm shown in Fig.30, many different variations of partial contact from different angles were captured to form the dataset, which allowed the trained ML model to be more generalized. [0223] The experimental setup for data collection is illustrated in Fig 30 and consists of the following: (1) Robot Manipulator: a KUKA LBR Med 14 R820 (KUKA AG) was used, which has seven degrees of freedom (DoF), a large operating envelope, and integrated Attorney Docket No.10046-568WO1 force sensing. An ROS as a bridge, with the iiwa_stack project presented in [22], was used to provide high-level control of the onboard Java environment. The workspace (2) included an arm rigidly attached to a worktable. An optical table allowed consistent positioning of items in the robot’s coordinate frame. The HySenSe sensor (3), which was manufactured in house, using the methodology provided in [21], and attached to the optical table. Images were captured by the camera, a 5MP Arducam with a 1/4” sensor (model OV5647, Arducam). Raspberry Pi 4B (4) controlled the camera over the Arducam’s camera ribbon cable. The Raspberry Pi ran Python software that continuously listened for an external ROS message trigger, which caused it to capture a 2592 × 1944 image and publish it to ROS. An adapting sample mount was 3D printed in PLA with one side attached to the robot flange, and the other side offered two locating pins and four M2 screw holes, which ensure repeatable position and orientation of samples. Polyp phantom samples (6) were constructed as described above, and each was attached, in turn, to the sample mount using M2 screws. Command and controls (7) included an Ubuntu 20.04 system to control the ROS components. This computer ran the roscore master and the high-level Python command scripts. [0224] To collect textural images with HySenSe, the robotic manipulator was commanded via ROS to position the phantom of interest into random positions with different angles of contact while maintaining the interaction force under a threshold of 3 Newtons using the arm’s internal force sensor. As opposed to previous manual procedures performed in Example 1, the only manual step involved was installing each target AGC phantom onto the sample mount on the robot end effector, and the remainder of the process was automated in software, reducing the time and workload required. [0225] Calibration and Registration Procedures. While the apparatus was assembled carefully, there was no way to ensure precise positioning to the sub-millimeter level. This level of accuracy was necessary to successfully automate the action of pressing down an AGC tumor phantom onto the HySenSe gel with a measured and limited force of interaction. Thus, a registration step was performed to find the unknown transformation matrices amongst the different components. The test setup was divided into five frames of reference, visualized in Fig.30, with the objective of determining the transformation matrices between these frames. Some of these transformations were known. The transformation from robot base (R) to robot flange (B), TRB, is provided through the KUKA software. The transformation from the camera frame (C) to a known target (T ), TCT, was calculated by using a standard checkerboard calibration method. The intrinsic matrix follows the pinhole camera model described in [23], and the 10-parameter distortion model follows [24] and [25]. The Attorney Docket No.10046-568WO1 remaining transformations were unknown, including the ones from the robot base to the camera frame (TRC), from the robot’s flange to the tumor phantom (TBT), and from the camera frame to the plastic plate that holds the HySenSe gel (TCH). [0226] Using the checkerboard calibration images as data, AX = XB calibration was performed using the separable method [26]. This provided the estimates of TRC and TBT. The resulting registration was still noisy, and fine-tuning was done with a simple grid pattern, adjusting the robot's position until it was centered and square in the camera image. The final part of the registration was finding TCH by bringing a flat sample plate down until it was within 1 mm of the HySenSe backing plate, with the gel removed, and checking that it was parallel with a feeler gauge. [0227] Dataset and Pre-processing. This semi-automated data collection setup enabled the collection of 50 variations of orientation and contact of the AGC tumor phantoms with the HySenSe in 44 experiments (one for each polyp), leading to a total of 2200 images in the textural image dataset, with 550 unique images in each class. This dataset was then split into training and test sets while ensuring that the split was performed at a tumor level and not the image level. In other words, all textural images belonging to one tumor were kept within the same split. This was done in order to ensure the model was not being evaluated on partially seen data. This resulted in 1600 images from 32 unique tumors in the training set and 600 images from 12 unique tumors in the test set. The output from HySenSe, with original dimensions of 2592 × 1944 pixels, was resized to a uniform size of 224 × 224 pixels, ensuring consistent input dimensions for all models. To enhance the algorithm’s generalization capability, a range of geometric transformations, including random cropping, vertical and horizontal flips, and random rotations spanning from -45° to 45°, were incorporated as augmentations. Further, to enhance the generalizability of data augmentation, Gaussian blur, and Gaussian noise with strengths σ ranging from 1 to 256 for blur, and σ from 1 to 50 for noise, were also included [27]. The maximums for these values were chosen to represent a level well beyond realistic worst-case scenarios [27]. Each augmentation had an independent occurrence probability of 0.5, contributing to the algorithm’s overall robustness. The images in the holdout test dataset, comprising 600 samples, were only resized to fit as inputs to all models. [0228] Deep Learning Model. In this study, the Dilated ResNet architecture was used for AGC tumor classification. The architecture’s effectiveness stems from its capacity to mitigate issues like exploding gradients [28]. An additional advantage of ResNets lies in their incorporation of skip connections, which helps alleviate the degradation problem associated Attorney Docket No.10046-568WO1 with the worsening performance of models as complexity increases [28]. Notably, unlike conventional Residual Network architectures, the Dilated ResNet incorporates dilated kernels. Dilations play a crucial role in maintaining feature maps’ spatial resolution during convolutions while expanding the network’s receptive field to capture more intricate details [29]. More details of the architecture of Dilated ResNet can be found in [16]. To verify our model’s performance relative to other commonly used architectures for image classification tasks, the performance metrics described for architecture with ResNet18 [30] and AlexNet [31] were compared. [0229] Evaluation. To limit the variability in performance due to the variance in hyperparameters, a random search across hyperparameter space was used to find a good candidate for each model for comparison (see Fig.32). The hyperparameter space in this study included the initial learning rate (LR), the LR scheduler, the optimizer, and weight decay. Four possible LR schedulers, namely Step, Reduce-on-Plateau, OneCycle [32], and Cosine Annealing [33] were considered. Three methods for the optimizer were considered, which were Stochastic Gradient Descent (SGD), Adam [34], and Adabound [35]. The full hyperparameter space is provided in Table 7. Furthermore, early stopping after 10 epochs was utilized to cut down on the training time in case the validation loss reached a point of no improvement well before the designated maximum of 50 epochs. One hundred (100) random configurations for each model architecture were considered in the random search. The best configuration was chosen such that the trained model minimized the validation set loss while keeping minimal overfit (which is the difference between the training and validation accuracy). Following this step, to ensure generalizable performance, each model was configured using the chosen hyperparameters and trained using stratified 5-fold cross- validation over the training split of data such that the class distribution in each fold remained intact. After verification of generalized performance, the entire training set was used to build the final models to be used for comparative evaluation. Table 7. Hyperparameter Space

Attorney Docket No.10046-568WO1

[0230] The standard evaluation metrics of accuracy (A), precision (P ), Recall (Re), F1-score (F 1), and Area under the ROC Curve (AUC) were used to evaluate model performance. Since this was a multi-class classification problem, the macro-averaged values (that is, the metrics are calculated for each class and then averaged) were reported. These metrics are crucial in assessing the model’s performance in AGC tumor classification, offering a more comprehensive evaluation than accuracy alone. Accuracy measures overall correctness, but in safety-critical applications like cancer diagnosis, emphasizing true positives and true negatives is vital. Apart from these metrics, the confusion matrices for each model were plotted to complete the analysis. The entire suite of training and tests was tracked using Weights and Biases [36]. [0231] Results and Discussion [0232] Hyperparameter Search. Table 8 lists the best-performing combination of hyperparameters for each model. For the dilated ResNet, the best LR scheduler was Cosine Annealing, while SGD was the best optimizer. The ResNet18 model performed best using the Step LR scheduler and Adabound optimizer. Finally, AlexNet also performed best using the Step LR scheduler coupled with an SGD optimizer. Notably, both the dilated ResNet and ResNet18 models required an initial learning rate of around 0.06-0.07 and a weight decay of around 0.03. On the other hand, AlexNet performed better with a relatively lower learning rate of 0.005, while having a higher weight decay of 0.047. [0233] Cross Validation. The average plots of training and validation accuracy for all three candidate models training using stratified 5-fold cross-validation with the hyperparameter configuration selected are provided in Fig.33. Here, ResNet18 has the largest variation in performance across the different folds, making it the most susceptible to changes in training data. The dilated ResNet model has the least variation, and therefore, it is the most general. Furthermore, the average accuracy across all folds is the lowest for ResNet18 and most for the dilated ResNet. Here again, the performance of AlexNet lies in between the other two models. While AlexNet reaches saturation much earlier than both of the other models, its variation in performance across the different folds and the lower peak accuracy as compared to the dilated ResNet18 hinder it from being the best overall pick. Table 8. Best Performing Configurations Attorney Docket No.10046-568WO1 Hyperparameter Dilated ResNet ResNet18 AlexNet LR 0.06308 0.07825 0.00527 LR Scheduler Cosine Annealing Step Step Optimizer SGD Adabound SGD Weight Decay 0.00338 0.00324 0.04663 [0234] Model comparisons. The results of the test set for the three models are summarized in Table 9, and the corresponding confusion matrices are provided in Fig.34. As can be seen through the evaluations, the dilated ResNet model proposed in the previous example for colorectal cancer classification is able to outperform both ResNet18 and AlexNet for AGC tumor classification. Observing the drop in testing accuracy for ResNet18, it can be concluded that this was a result of overfitting during training. While AlexNet achieved comparable performance in terms of the performance metrics, its susceptibility to training data made it unsuitable for this application. Furthermore, the number of trainable parameters in dilated ResNet is only 2.8M compared to 11.2M and 57M trainable parameters in ResNet and AlexNet, respectively. Thus, the dilated ResNet is able to outperform the other two models despite being comparatively lightweight and less complex. Looking at the confusion matrices, it was observed that all three models have some difficulties in differentiating between Types II and III polyps. This was expected since both tumor types present similarly on the exterior and there can be an overlap in diagnosis even amongst clinicians [3]. [0235] In conclusion, the disclosed Vision-based Tactile Sensor (VTS) and a complementary Machine Learning (ML) algorithm were used for classifying gastric polyp tumors to address the existing limitations on endoscopic diagnosis of Advanced Gastric Cancer (AGC) Tumors. By leveraging a seven-degree-of-freedom robotic manipulator and unique custom-designed and additively-manufactured realistic AGC tumor phantoms, the advantages of automated data collection using the VTS addressing the problem of data scarcity and biases encountered in traditional ML-based approaches are demonstrated. The synthetic-data-trained ML model was successfully evaluated and compared with traditional ML models utilizing various statistical metrics even under mixed morphological characteristics and partial sensor contact. Table 9. Performance Results Metric Dilated ResNet ResNet18 AlexNet Accuracy 0.9667 0.8333 0.9600 Precision 0.9656 0.8685 0.9631 Attorney Docket No.10046-568WO1 Recall 0.9692 0.8175 0.9550 F1 Score 0.9673 0.8422 0.9591 AUC 0.9990 0.9879 0.9983 Example 4: Addressing Biases in Gastric Cancer Diagnosis Through Generative Models and Vision-Based Surface Tactile Sensing [0236] Discussion [0237] Gastric cancer (GC) is the fifth most commonly diagnosed cancer globally and the fourth leading cause of cancer-related deaths [1]. The main reason for this high rate is that up to 62% of GC cases are diagnosed at advanced stages, leading to poorer survival rates [2]. Upper endoscopy is the primary method for initial detection, allowing visualization of the gastric tract lining where tumors are typically formed. Advanced gastric cancer (AGC) penetrates the muscularis propria and can be classified morphologically using the Borrmann classification into four types: polypoid (Type 1), fungating (Type 2), ulcerated (Type 3), and infiltrating or flat (Type 4) [3]. However, variability within classes and sole reliance on morphology result in inconsistency in clinical decision-making [4]. Endoscopic diagnosis also faces challenges like limited camera resolution, visual occlusion, inadequate steerability, and lighting changes, requiring extensive training for accurate detection [5, 6]. To address these limitations, Artificial Intelligence (AI) methods using Machine Learning (ML) have been applied to histopathological images and endoscopic videos for detecting and classifying GC tumors [7–10]. Nevertheless, these approaches are hindered by the limited availability of large and balanced datasets [11]. The American Medical Association highlighted this issue in 2018, recommending policies to mitigate bias in AI/ML-based healthcare applications [12]. Further, Deep Learning (DL) approaches require large amounts of data to generalize well, but collecting patient data is challenging due to time constraints, case availability, and privacy concerns, leading to spectrum bias and overfitting [13]. AGC tumors exhibit significant inter- class morphological variance, complicating the creation of balanced datasets [4]. While transfer learning can partially mitigate data access limitations, class imbalance still is a major challenge affecting the model’s ability to learn features of rare cancer cases. [0238] To address the early diagnosis of AGC, the disclosed Vision-Based Tactile Sensor called HySenSe [14–17] is evaluated based on its performance on both colorectal and gastric cancer tumors through various experiments [18], [19]. Unlike typical colonoscopic images, HySenSe generates high-resolution textural images of the surface in contact even under low interaction forces (< 1 N). However, a significant hurdle with this unique sensor is Attorney Docket No.10046-568WO1 the lack of access to sufficient images for training AI/ML models for reliable and accurate diagnosis of cancer polyps. Particularly in the case of GC polyps, which are typically larger (4 cm × 4 cm) [3] than the sensing area of HySenSe, only partial surface images of polyps can be captured. In the previous example [18], the effectiveness of utilizing synthetically generated images to improve classification performance on colorectal cancer (CRC) polyps was demonstrated. In this work, the use of disclosed generative models is evaluated to not only augment HySenSe textural images of AGC tumors but also focus specifically on mitigating biases during training ML models. In particular, a single class-conditioned latent diffusion model [20] was trained on an imbalanced dataset of HySenSe images, monitoring the performance using metrics such as Frechet inception distance (FID) [21]. The classification model was trained and evaluated by mixing the real textural images with synthetically generated images during training, and thoroughly exploring different methods of the data augmentation process. The classification performance was tested only on real images, focusing on generalizability through cross-validation and degree of overfitting. [0239] Methods [0240] Vision-based Tactile Sensor. In this study, the disclosed Vision-Based Tactile Sensor (VTS), HySenSe (see Fig.30), was used to acquire high-fidelity textural images of AGC tumor phantoms. [0241] Realistic Tumor Phantoms and Data Collection Procedure. The method and types of realistic tumor phantoms are the same as those used in Example 3. Reference is given to Fig.31. [0242] The dataset used in this study was collected using a semi-autonomous robotic data collection procedure utilizing a KUKA LBR Med 14 R820 (KUKA AG). In Example 1, using HySenSe for CRC polyp classification, high-fidelity textural images were captured manually with a setup involving a force gauge on a linear stage [15, 16, 23]. However, such a labor-intensive process limited data collection efficiency. Additionally, AGC tumors are significantly larger than CRC polyps, making it impossible for the VTS to capture the entire textural area of the tumor phantoms. These limitations were overcome by utilizing the robot manipulator,allowing for faster data collection across many different variations. Overall, 50 variations of orientation and contact of the AGC tumor phantoms with the HySenSe in 44 experiments (one for each polyp) were captured, leading to a total of 2200 images in the textural image dataset, with 550 unique images in each class. Further details about the robotic data collection procedure can be found in Example 2. During the actual training and evaluation, towards mimicking real-world distributions of these AGC tumors, the class Attorney Docket No.10046-568WO1 distribution was artificially altered to induce imbalances. This allowed us to effectively study the effects of data bias and explore mitigation strategies through synthetic data augmentation. [0243] Generative Model for Synthetic Data Augmentation. Unlike Example 2 of the present disclosure, dedicated direct diffusion models were used to generate images of each class; in this study, a singular Class-Conditioned Latent Diffusion Model (CC-LDM) was used to generate textural images of all classes. By moving from the pixel space where direct diffusion models operate to the latent space of pre-trained autoencoders, latent diffusion models reduce computational complexity while preserving high-resolution details and improving visual fidelity [20]. Exemplary system diagrams are shown in Figs.35A and 35B, which show the system processes for training and inference. [0244] The CC-LDM consists of (1) A pre-trained variational autoencoder (VAE) model with Kullback–Leibler (KL) loss [24] to encode/decode images to and from the latent space, (2) A UNet2D [25] backbone for predicting noise velocity, and (3) A Denoising Diffusion Implicit Model (DDIM) [26] scheduler to control the denoising process. During inference, the DDIM Scheduler was swapped out with a UniPC Multistep Scheduler [27] for faster sampling. The CC-LDM enables labeled image generation of each possible class simply by passing the desired class as input. For example, the input can be 50 images of “Borrmann Type I”, where 50 different variations of HySenSe textural images of that class will be generated. [0245] Classification Models. In this study, the Dilated ResNet architecture for AGC tumor classification was used. The ResNet architecture benefits from skip connections, which help alleviate the degradation problem, where performance deteriorates as model complexity increases [28]. Unlike conventional Residual Network architectures, the Dilated ResNet incorporates dilated kernels. These dilations are essential for preserving the spatial resolution of feature maps during convolutions while expanding the network’s receptive field to capture more intricate details [29]. This architecture is also effective in mitigating issues such as exploding gradients. [0246] Experiments and Evaluation Metric. For training the CC-LDM, different learning rates (ranging from 0.01 to 0.00001) and noise schedules (linear, scaled linear, and cosine) were considered, while the sampling process was performed as described in previously established work [30]. The performance tracking was achieved through metrics such as Inception Score (IS) and Frechet inception distance (FID). The model was trained for both conditional and unconditional sampling to allow for Classifier-Free-Guidance (CFG) [31] during inference, which allows the user to control the amount of influence the Attorney Docket No.10046-568WO1 conditioning has on the image output. Once the CC-LDM was trained, 500 synthetic images were generated for each class by passing the label as input and using a CFG scale of 7.5. This led to a total of 2000 synthetic images across the four possible Borrmann types. The next step was to train the classifier using 5-fold stratified cross-validation. The real dataset was divided into training and validation splits, and only the train split was augmented using synthetic images using the data-addition process under evaluation. The model was first pre-trained on the ILSVRC subset of the ImageNet [41] database [32], followed by fine-tuning on the augmented train split comprising both real and synthetic images. The amount of synthetic data was varied from 0% to 50% with respect to the total composition of the train split while keeping the amount of real data in the split constant. To thoroughly analyze the effect of synthetic data augmentation, different methods were considered: (1) random-add, where synthetic data was randomly sampled and added to the train split utilizing no prior knowledge, (2) equal-add, where equal amounts of synthetic data were added to each class, and (3) scale-by-inverse, where synthetic data was added in amounts inversely proportional to the class distribution of the real dataset. To quantify the effectiveness of these approaches, accuracy spread across different folds, as well as the mean overfit for each configuration, was logged. [0247] Results and Discussion [0248] Training Diffusion Models. The IS and FID were used to evaluate the models’ performances, as shown in Fig.36. A minimum FID of 67.4 was obtained after 200 epochs of training across the different learning rates and noise schedules considered. Note that since FID was calculated using an Inception v3 network [33], which is trained on images of everyday objects from ImageNet [32] that are different from our textural dataset, FID was used only as a comparative metric. The trained model was then further used for downstream tasks. [0249] Synthetic Images. Representative images generated for each class are presented in Figs.37A-37B. Upon visual inspection in comparison with the real experimental images, the synthetic images generated have high levels of detail with distinct features for each class. Furthermore, the LED color distribution generated in the synthetic images perfectly replicates the real images collected using the robotic system. [0250] Classification Performance and Bias. In a comparative study, each of the methods of data addition was trained and cross-validated (see Fig.38A). In the study, synthetic data was added to real training data, and the amount of synthetic data in the data composition of the augmented train was set from 0 to 50% (keeping the real data amount Attorney Docket No.10046-568WO1 constant ~ 7000 images). A 5-fold cross-validation for each configuration was carried out where synthetic data was added to each fold dynamically. The study tested on only real images. [0251] Figure 38B shows the results of a cross-validation test for the random addition of synthetic data. As shown, as the amount of synthetic data is increased, the training accuracy only slightly declines (from 100% to 99%), and the test accuracy markedly increases from 85% to 96%, which suggests a significant reduction in overfitting. However, it was also observed that Random addition of data does not make the model more generalized since the accuracy spread across folds was not significantly affected. At 0% synthetic data, the test accuracy ranges from 83% to 86%, and upon increasing this amount, this range fluctuates. At 25% synthetic data, the test accuracy spread is from 87% to 94%, and at 50% synthetic data, the spread is from 94% to 97%. [0252] Results of adding synthetic data by the scale-by-inverse method are shown in Fig.38C. Training accuracy shows a decreasing trend as synthetic data is added, while validation accuracy increases, which signals that the scale-by-inverse model is overfitting less. Validation accuracy spread also reduces, indicating the scale-by-inverse method can help in reducing bias. [0253] In conclusion, our study demonstrates the application of generative models to create realistic textural images of AGC polyps, addressing the critical need for diverse and balanced datasets needed for diagnosis of cancerous polyps using machine learning. Our approach, involving the training of generative models on existing medical data and subsequent generation of synthetic samples, demonstrates successes in enriching both dataset quality and size. Example 5: On the Effects of Lighting Variations of Vision-Based Tactile Sensors On AI- Based Diagnosis of Advanced Gastric Cancer [0254] In this study, the importance of color for the classification of AGC textural tumor images is demonstrated by testing the generalization ability of the classification models on textural images synthetically altered using hue, brightness, and contrast adjustments. Through these color-centric data augmentations, a classifier agnostic to the camera conditions and internal lightning of the vision-tactile sensor is presented. [0255] Materials and Methods [0256] Dataset. The dataset used in this work was originally presented in Example 3. The dataset consists of high-resolution textural images of Advanced Gastric Cancer (AGC) Attorney Docket No.10046-568WO1 tumor phantoms, which were captured using a vision-based tactile sensor (VTS), i.e., HySenSe sensor, mounted on a semi-autonomous robotic system (see Fig.30). The following subsections detail each component of the dataset creation procedure. [0257] Vision-based Tactile Sensor. In this study, the disclosed Vision-Based Tactile Sensor (VTS), HySenSe (see Fig.30), was used to acquire high-fidelity textural images of AGC tumor phantoms. [0258] Realistic Tumor Phantoms and Data Collection Procedure. The method and types of realistic tumor phantoms are the same as those used in Example 3. Reference is given to Fig.31. [0259] Data Splitting and Sensor Simulation. The data was split into training and testing datasets using a stratified splitting technique to ensure both the training and testing subsets accurately reflect the overall distribution of categories in the full dataset with 80% of the data allocated for training and 20% for testing. Notably, this splitting was performed at the tumor level (and not the individual image level) to ensure that the same tumor was not encountered during both training and testing. In order to simulate variation VTS sensors, the color space a replicated version of the original test set was modified by applying a random level of hue, saturation, brightness, and contrast and is called simulated test set hereafter. [0260] Deep Learning Model. The ResNet18 architecture was chosen for AGC tumor classification based on the findings of Example 3, which was done solely on real images and demonstrated to provide the best classification results. ResNet models are particularly known for their skip connections, which mitigate exploding gradients and performance degradation issues; a common problem where increasing complexity leads to diminishing model effectiveness [24]. [0261] Color Centric Data Augmentation. Apart from standard data augmentation techniques such as random rotations, vertical and horizontal flips, etc., a variety of color- centric data augmentation techniques were employed to explore their effects on the performance of the models. These techniques involve modifying the color space of images in the training and validation datasets to enhance the model’s robustness and generalizability. Here is the outline of the augmentation scenarios implemented: [0262] (1) Baseline/No Aug.: No augmentations were applied to either the training, validation, or test datasets. This scenario serves as a control to assess the impact of color augmentations on model performance. Attorney Docket No.10046-568WO1 [0263] (2) Hue Adjustment: The hue of the images in the training and validation datasets was randomly altered. This test assesses how well the model can handle variations in lighting color. [0264] (3) Saturation Adjustment: The saturation level of the images in the training and validation datasets was randomly changed by 50%. [0265] (4) Brightness Adjustment: The brightness of images in the training and validation datasets was randomly modified by 50%. [0266] (5) Contrast Adjustment: The contrast of the images in the training and validation datasets was randomly varied by 50%. [0267] (6) Combination of Augmentations (All): This scenario combined hue, saturation, brightness, and contrast adjustments, applying all these transformations to each image in the training and validation datasets. This comprehensive approach tests the model’s robustness across multiple types of visual modifications. [0268] One leaves out augmentations: All augmentations except hue, saturation, brightness, and contrast were applied to identify the ones that cause the most drop in performance metrics named (7) W/O Hue, (8) W/O Saturation, (9) W/O Brightness and (10) W/O Contrast. [0269] (11) Grayscale: All images in the training, validation, and test datasets were converted to grayscale. This involves reducing the three color channels (red, green, and blue) to a single gray channel but retaining a three-channel output to accommodate the model architecture. This tests the model’s ability to learn from intensity variations alone. [0270] (12) All + Grayscale: All augmentations were applied to the training and validation datasets, and then all datasets, including the test dataset were converted to grayscale images. [0271] Figure 39 depicts the aforementioned scenarios for the four tumor types. [0272] Evaluation. In order to evaluate the performance of the model in different scenarios, we employed the following set of metrics: Accuracy (Acc) to gauge the overall correctness of the model across all predictions. Precision (Prec) and Recall (Rec) to measure the model’s ability to identify only relevant instances and its success rate in actually detecting these instances and F1 Score (F 1) as a single harmonized metric of precision and recall. Lastly, the Area Under the Curve (AUC) to measure the model’s capability to discriminate between classes at various threshold settings, providing a robust indicator of its predictive power. [0273] Results Attorney Docket No.10046-568WO1 [0274] Table 10 and Fig.40 show the impact of various augmentation scenarios on the model’s performance. The application of individual augmentation scenarios variably affected model performance across different metrics. The model trained without any augmentation (Baseline) consistently performed lower across all metrics. Table 10. Models performance metrics on the simulated test set under different image augmentation scenarios.

Attorney Docket No.10046-568WO1 [0275] Among the single augmentation scenarios, hue and contrast augmentations significantly enhanced the model’s performance metrics, suggesting that color augmentation could be critical in contexts with different VTS LED colors and sensor properties. Similarly, brightness adjustment yielded the highest single augmentation impact on all performance metrics, indicating its potential utility in scenarios involving varied lighting conditions. [0276] The most significant improvements in model performance metrics were observed when all augmentation techniques were applied simultaneously (All). This scenario showed a substantial increase across all metrics, with the accuracy, precision, recall, and F1 score rising to 0.86, and the AUC to 0.97. These results underscore the effectiveness of a comprehensive augmentation strategy, demonstrating that the synergistic application of multiple techniques can substantially enhance model robustness and generalization capability across varied imaging conditions. [0277] With the omission of scenarios, removing hue augmentation (W/O Hue) resulted in the most significant drops across performance metrics, demonstrating the importance of this augmentation. It suggests that models trained on data from one VTS sensor with a specific lighting color and pattern would not be applicable for classifying data from other VTS sensors without hue augmentation. Omitting brightness augmentation (W/O Brightness) had a significant effect, but not as large as W/O Hue, underscoring its importance for the generalizability of the model. Removing saturation and contrast augmentations (W/O Saturation and W/O Contrast) had a small effect on the performance metrics. Table 11. Models performance metrics on the simulated test set under different image augmentation scenarios.

Attorney Docket No.10046-568WO1

[0278] Applying grayscale adjustment (Grayscale) resulted in the lowest performance metrics among all scenarios. Although adding all other augmentations before applying grayscale improved performance across all metrics by approximately 0.31, the model performance was still significantly lower than in the All Augs. scenario; for example, accuracy and precision were 0.65 and 0.67 versus 0.86 and 0.86. This emphasizes the importance of colored lighting in the classification of tumor types. [0279] All augmentation scenarios were applied to the original test dataset to assess how augmentation could degrade model performance (Fig.41 and Table 11). The Baseline scenario with no augmentation showed the highest accuracy, precision, and F1 score of 0.99. All augmentation scenarios demonstrated some level of performance degradation but still maintained satisfactory performance, with the lowest values (Acc, Prec, Rec, F 1 > 0.81) observed in the (All Augs. + Grayscale). Interestingly, Grayscale showed similar performance to the Baseline scenario, suggesting that when a model is trained on images from a single VTS sensor, color channels can be omitted without sacrificing model performance. [0280] Among different gastric cancer polyp types, fungating and ulcerated had the highest missing rates by the models in most augmentation scenarios, which can be attributed to the morphological similarity between these two types of polyps. Attorney Docket No.10046-568WO1 [0281] The impact of various augmentation techniques on the models’ performance trained to classify gastric cancer polyp types using VTS sensor data was demonstrated. The application of color-related augmentations such as hue, brightness, and contrast adjustments substantially enhanced the model’s ability to generalize across different imaging conditions. Specifically, using a comprehensive augmentation strategy by combining multiple techniques was shown to be the most effective. Moreover, the removal of hue and brightness adjustments led to notable declines in performance, under-scoring their importance in handling variability in lighting conditions and sensor characteristics. These insights emphasize the necessity of a tailored augmentation approach, particularly when dealing with high variability in medical imaging data, to enhance the robustness and applicability of machine learning models in clinical settings. Example 6: Example Computing Device [0282] It should be appreciated that the logical operations described above can be implemented in some embodiments (1) as a sequence of computer-implemented acts or program modules running on a computing system and/or (2) as interconnected machine logic circuits or circuit modules within the computing system. The implementation may be a matter of choice dependent on the performance and other requirements of the computing system. Accordingly, the logical operations described herein are referred to variously as state operations, acts, or modules. These operations, acts, and/or modules can be implemented in software, in firmware, in special purpose digital logic, in hardware, and any combination thereof. It should also be appreciated that more or fewer operations can be performed than shown in the figures and described herein. These operations can also be performed in a different order than those described herein. [0283] In addition to the various systems discussed herein, Fig.16 shows an illustrative computer architecture for a computing device 1600 capable of executing the software components that can use the output of the exemplary method described herein. The computer architecture shown in Fig.16 illustrates an example computer system configuration, and the computing device 1600 can be utilized to execute any aspects of the components and/or modules presented herein described as executing on the analysis system or any components in communication therewith, including providing support of TEE as described herein as well as trusted Time, GPS, and Monotonic Counter as noted above. [0284] In an embodiment, the computing device 1600 may comprise two or more computers in communication with each other that collaborate to perform a task. For example, Attorney Docket No.10046-568WO1 but not by way of limitation, an application may be partitioned in such a way as to permit concurrent and/or parallel processing of the instructions of the application. Alternatively, the data processed by the application may be partitioned in such a way as to permit concurrent and/or parallel processing of different portions of a data set by the two or more computers. In an embodiment, virtualization software may be employed by the computing device 1600 to provide the functionality of a number of servers that is not directly bound to the number of computers in the computing device 1600. For example, virtualization software may provide twenty virtual servers on four physical computers. In an embodiment, the functionality disclosed above may be provided by executing the application and/or applications in a cloud computing environment. Cloud computing may comprise providing computing services via a network connection using dynamically scalable computing resources. Cloud computing may be supported, at least in part, by virtualization software. A cloud computing environment may be established by an enterprise and/or may be hired on an as-needed basis from a third- party provider. Some cloud computing environments may comprise cloud computing resources owned and operated by the enterprise as well as cloud computing resources hired and/or leased from a third-party provider. [0285] In its most basic configuration, computing device 1600 typically includes at least one processing unit 1620 and system memory 1630. Depending on the exact configuration and type of computing device, system memory 1630 may be volatile (such as random-access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two. [0286] This most basic configuration is illustrated in Fig.16 by dashed line 1610. The processing unit 1620 may be a programmable processor that performs arithmetic and logic operations necessary for the operation of the computing device 1600. While only one processing unit 1620 is shown, multiple processors may be present. As used herein, processing unit and processor refers to a physical hardware device that executes encoded instructions for performing functions on inputs and creating outputs, including, for example, but not limited to, microprocessors (MCUs), microcontrollers, graphical processing units (GPUs), and application-specific circuits (ASICs). Thus, while instructions may be discussed as executed by a processor, the instructions may be executed simultaneously, serially, or otherwise executed by one or multiple processors. The computing device 1600 may also include a bus or other communication mechanism for communicating information among various components of the computing device 1600. Attorney Docket No.10046-568WO1 [0287] Computing device 1600 may have additional features/functionality. For example, computing device 1600 may include additional storage such as removable storage 1640 and non-removable storage 1650 including, but not limited to, magnetic or optical disks or tapes. Computing device 1600 may also contain network connection(s) 1680 that allow the device to communicate with other devices such as over the communication pathways described herein. The network connection(s) 1680 may take the form of modems, modem banks, Ethernet cards, universal serial bus (USB) interface cards, serial interfaces, token ring cards, fiber distributed data interface (FDDI) cards, wireless local area network (WLAN) cards, radio transceiver cards such as code division multiple access (CDMA), global system for mobile communications (GSM), long-term evolution (LTE), worldwide interoperability for microwave access (WiMAX), and/or other air interface protocol radio transceiver cards, and other well-known network devices. Computing device 1600 may also have input device(s) 1670 such as keyboards, keypads, switches, dials, mice, trackballs, touch screens, voice recognizers, card readers, paper tape readers, or other well-known input devices. Output device(s) 1660 such as printers, video monitors, liquid crystal displays (LCDs), touch screen displays, displays, speakers, etc. may also be included. The additional devices may be connected to the bus in order to facilitate the communication of data among the components of the computing device 1600. All these devices are well known in the art and need not be discussed at length here. [0288] The processing unit 1620 may be configured to execute program code encoded in tangible, computer-readable media. Tangible, computer-readable media refers to any media that is capable of providing data that causes the computing device 1600 (i.e., a machine) to operate in a particular fashion. Various computer-readable media may be utilized to provide instructions to the processing unit 1620 for execution. Example tangible, computer-readable media may include, but is not limited to, volatile media, non-volatile media, removable media, and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. System memory 1630, removable storage 1640, and non- removable storage 1650 are all examples of tangible, computer storage media. Example tangible, computer-readable recording media include, but are not limited to, an integrated circuit (e.g., field-programmable gate array or application-specific IC), a hard disk, an optical disk, a magneto-optical disk, a floppy disk, a magnetic tape, a holographic storage medium, a solid-state device, RAM, ROM, electrically erasable program read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other Attorney Docket No.10046-568WO1 optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. [0289] In light of the above, it should be appreciated that many types of physical transformations take place in the computer architecture 1600 in order to store and execute the software components presented herein. It also should be appreciated that the computer architecture 1600 may include other types of computing devices, including hand-held computers, embedded computer systems, personal digital assistants, and other types of computing devices known to those skilled in the art. It is also contemplated that the computer architecture may not include all of the components shown in Fig.16, may include other components that are not explicitly shown in Fig.16, or may utilize an architecture different than that shown in FIG.16. [0290] In an example implementation, the processing unit 1620 may execute program code stored in the system memory 1630. For example, the bus may carry data to the system memory 1630, from which the processing unit 1620 receives and executes instructions. The data received by the system memory 1630 may optionally be stored on the removable storage 1640 or the non-removable storage 1650 before or after execution by the processing unit 1620. [0291] It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination thereof. Thus, the methods and apparatuses of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium wherein, when the program code is loaded into and executed by a machine, such as a computing device, the machine becomes an apparatus for practicing the presently disclosed subject matter. In the case of program code execution on programmable computers, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. One or more programs may implement or utilize the processes described in connection with the presently disclosed subject matter, e.g., through the use of an application programming interface (API), reusable controls, or the like. Such programs may be implemented in a high-level procedural or object-oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. Attorney Docket No.10046-568WO1 In any case, the language may be a compiled or interpreted language, and it may be combined with hardware implementations. [0292] In some examples, the methods and systems described herein form a software suite that may be utilized as software as a service. The software may be written to a non- transitory computer readable medium, stored on a processor of a local computing device or on a cloud computing system to be accessed remotely. [0293] Moreover, the various components may be in communication via wireless and/or hardwired or other desirable and available communication means, systems and hardware. Moreover, various components and modules may be substituted with other modules or components that provide similar functions. [0294] Some references, which may include various patents, patent applications, and publications, are cited in a reference list and discussed in the disclosure provided herein. The citation and/or discussion of such references is provided merely to clarify the description of the present disclosure and is not an admission that any such reference is “prior art” to any aspects of the present disclosure described herein. In terms of notation, “[n]” corresponds to the nth 10 reference in the list. All references cited and discussed in this specification are incorporated herein by reference in their entireties and to the same extent as if each reference was individually incorporated by reference. [0295] Although example embodiments of the present disclosure are explained in some instances in detail herein, it is to be understood that other embodiments are contemplated. Accordingly, it is not intended that the present disclosure be limited in its scope to the details of construction and arrangement of components set forth in the following description or illustrated in the drawings. The present disclosure is capable of other embodiments and of being practiced or carried out in various ways. [0296] It must also be noted that, as used in the specification and the appended claims, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise. Ranges may be expressed herein as from “about” or “5 approximately” one particular value and/or to “about” or “approximately” another particular value. When such a range is expressed, other exemplary embodiments include from the one particular value and/or to the other particular value. [0297] By “comprising” or “containing” or “including” is meant that at least the name compound, element, particle, or method step is present in the composition or article or method, but does not exclude the presence of other compounds, materials, particles, method Attorney Docket No.10046-568WO1 steps, even if the other such compounds, material, particles, method steps have the same function as what is named. [0298] In describing example embodiments, terminology will be resorted to for the sake of clarity. It is intended that each term contemplates its broadest meaning as understood by those skilled in the art and includes all technical equivalents that operate in a similar manner to accomplish a similar purpose. It is also to be understood that the mention of one or more steps of a method does not preclude the presence of additional method steps or intervening method steps between those steps expressly identified. Steps of a method may be performed in a different order than those described herein without departing from the scope of the present disclosure. Similarly, it is also to be understood that the mention of one or more components in a device or system does not preclude the presence of additional components or intervening components between those components expressly identified. [0299] The term “about,” as used herein, means approximately, in the region of, roughly, or around. When the term “about” is used in conjunction with a numerical range, it modifies that range by extending the boundaries above and below the numerical values set forth. In general, the term “about” is used herein to modify a numerical value above and below the stated value by a variance of 10%. In one aspect, the term “about” means plus or minus 10% of the numerical value of the number with which it is being used. Therefore, about 50% means in the range of 45%-55%. Numerical ranges recited herein by endpoints include all numbers and fractions subsumed within that range (e.g., 1 to 5 includes 1, 1.5, 2, 2.75, 3, 3.90, 4, 4.24, and 5). [0300] Similarly, numerical ranges recited herein by endpoints include subranges subsumed within that range (e.g., 1 to 5 includes 1-1.5, 1.5-2, 2-2.75, 2.75-3, 3-3.90, 3.90-4, 4-4.24, 4.24-5, 2-5, 3-5, 1-4, and 2-4). It is also to be understood that all numbers and fractions thereof are presumed to be modified by the term “about.” Exemplary Aspects Exemplary aspect 1. A method for post-processing a machine learning classification, the method comprising: classifying a dataset as one or more classes using a trained machine learning classifier and calculating an associated probability score for each of the one or more classes; calibrating the probability score of the classification of the trained machine learning classifier using a regression operator (e.g. using the Cascade Reliability Framework (CRF));and displaying as a report, the calibrated probability score for the classification of the machine learning classifier. Attorney Docket No.10046-568WO1 Exemplary aspect 2. The method of exemplary aspect 1 further comprising ranking the one or more classes of the classification using (e.g., the conformal prediction module) a Naïve Conformal prediction algorithm or regularized adaptive prediction sets algorithm. Exemplary aspect 3. The method of exemplary aspect 2, the method further comprising receiving an error rate variable from a user using an input means. Exemplary aspect 4. The method of any one of exemplary aspects 1-3, wherein calibrating (e.g., classification calibration module) comprises temperature scaling, variational temperature scaling, or a combination thereof. Exemplary aspect 5. The method of exemplary aspect 4, wherein the calibrated probability of the ranked classes is output up to one minus the error rate variable. Exemplary aspect 6. The method of any one of exemplary aspects 1-5, wherein a penultimate layer of the machine learning classifier comprises a Softmax function. Exemplary aspect 7. The method of any one of exemplary aspects 1-6, wherein the report or display comprises the probability score and associated class. Exemplary aspect 8. The method of any one of exemplary aspects 1-7, wherein the report or display comprises the probability score, associated class, and a visual representation of the dataset. Exemplary aspect 9. The method of any one of exemplary aspects 1-8, wherein the dataset comprises a medical image. Exemplary aspect 10. The method of any one of exemplary aspects 1-9, wherein the dataset comprises an engineering image. Exemplary aspect 11. The method of any one of exemplary aspects 1-10, wherein the dataset comprises a synthetically produced image. Exemplary aspect 12. The method of exemplary aspect 11, wherein the machine learning classifier is trained on a plurality of synthetically produced images. Exemplary aspect 13. The method of exemplary aspects 11 or 12, wherein a generative model is used to produce the synthetic image or plurality of synthetic images. Exemplary aspect 14. The method of any one of exemplary aspects 1-13, wherein the machine learning classifier is trained on a plurality of augmented images. Exemplary aspect 15. The method of exemplary aspect 14, wherein augmentation of the plurality of augmented images comprises adding random noise, random blur, random rotations, random cropping, vertical and horizontal flips, or any combination thereof. Exemplary aspect 16. The method of any one of exemplary aspects 1-15, wherein the machine learning classifier comprises a support vector machine (SVM), a neural network, a Attorney Docket No.10046-568WO1 convoluted neural network (CNN), a densely connected CNN, a residual network, type classifier. Exemplary aspect 17. The method of any one of exemplary aspects 1-16, the method further comprises calculating a heat map of the dataset (e.g., vision-transformer-based classifier). Exemplary aspect 18. The method of exemplary aspect 17, the method further comprising adding descriptive text to the report (e.g., text-transformer module). Exemplary aspect 19. The method of exemplary aspect 18, wherein the descriptive text may be generated by a natural language model (e.g., LLM). Exemplary aspect 20. A system for post-processing a machine learning classification, the system comprising: one or more processors; an output device; and a memory, the memory storing instructions thereon, that when executed by the one or more processors, causes the one or more processors to perform a method, the method comprising: classifying a dataset as one or more classes using a machine learning classifier and calculating an associated probability score for each of the one or more classes; calibrating the probability score of the classification of the machine learning classifier using a regression operator (e.g. using the Cascade Reliability Framework (CRF)); and displaying as a report or display, on the output device, the calibrated probability score for the classification of the machine learning classifier. Exemplary aspect 21. The system of exemplary aspect 20, wherein the one or more processors are connected by any communication means (e.g., directly, wirelessly, etc.). Exemplary aspect 22. The system of exemplary aspects 20 or 21, wherein the trained machine learning classifier is executed on a first processor of the one or more processors. Exemplary aspect 23. The system of anyone of exemplary aspects 20-22, wherein the method further comprises ranking the one or more classes of the classification using (e.g. the conformal prediction module) a Naïve Conformal prediction algorithm or regularized adaptive prediction sets algorithm. Exemplary aspect 24. The system of anyone of exemplary aspects 20-23, the method further comprising receiving an error rate variable from a user using an input means. Exemplary aspect 25. The system of anyone of exemplary aspects 20-24, wherein calibrating (e.g. classification calibration module) comprises temperature scaling, variational temperature scaling, or a combination thereof. Exemplary aspect 26. The system of anyone of exemplary aspects 20-25, wherein the calibrated probability of the ranked classes is output up to one minus the error rate variable. Attorney Docket No.10046-568WO1 Exemplary aspect 27. The system of anyone of exemplary aspects 20-26, wherein a penultimate layer of the machine learning classifier comprises a Softmax function. Exemplary aspect 28. The system of anyone of exemplary aspects 20-27, wherein the report or display comprises the probability score and associated class. Exemplary aspect 29. The system of anyone of exemplary aspects 20-28, wherein the report comprises the probability score, associated class, and a visual representation of the dataset. Exemplary aspect 30. The system of anyone of exemplary aspects 20-29, wherein the dataset comprises a medical image. Exemplary aspect 31. The system of anyone of exemplary aspects 20-30, wherein the dataset comprises an engineering image. Exemplary aspect 32. The system for post-processing classifications of anyone of exemplary aspects 20-31, wherein the dataset comprises a synthetically produced image. Exemplary aspect 33. The system of anyone of exemplary aspects 20-32, wherein the machine learning classifier is trained on a plurality of synthetically produced images. Exemplary aspect 34. The system of anyone of exemplary aspects 32-33, wherein a generative model is used to produce the synthetic image or plurality of synthetic images. Exemplary aspect 35. The system of any one of exemplary aspects 20-34, wherein the machine learning classifier is trained on a plurality of augmented images. Exemplary aspect 36. The system of exemplary aspect 35, wherein augmentation of the plurality of augmented images comprises adding random noise, random blur, random rotations, random cropping, vertical and horizontal flips, or any combination thereof. Exemplary aspect 37. The system of any one of exemplary aspects 20-36, wherein the machine learning classifier comprises, a support vector machine (SVM), a neural network, a convoluted neural network (CNN), a densely connected CNN, a residual network, or any other suitable machine learning model. Exemplary aspect 38. The system of any one of exemplary aspects 20-37, the method further comprises calculating a heat map of the dataset (e.g. vision-transformer based classifier). Exemplary aspect 39. The system of exemplary aspect 38, the method further comprising adding descriptive text to the output display (e.g., explainer module). Exemplary aspect 40. The system of exemplary aspect 39, wherein the descriptive text may be generated by a natural language model (e.g. LLM). Exemplary aspect 41. The system of any one of exemplary aspects 20-23, wherein the dataset is a plurality of medical images. Attorney Docket No.10046-568WO1 Exemplary aspect 42. An interactive artificial intelligence system, the system comprising: one or more processors; an output device; an input device; and two or more data storage devices, a first data storage device storing instructions thereon, that when executed by the one or more processors, causes the one or more processors to perform a method, the method comprising: classifying an image and generating an associated probability score using a trained machine learning classifier ; receiving, from a user by an input device, an error rate variable value; calibrating the probability score of the image classification of the machine learning classifier using a regression operator (e.g. a Cascade Reliability Framework (CRF)) wherein the confidence of the calibrated probability score is related to the user's error rate variable value; calculating an attention map of the image (e.g. vision-transformer based classifier); adding descriptive text to the attention map; displaying, on the output device, the calibrated probability score for the image classification of the machine learning classifier, the attention map and descriptive text . Exemplary aspect 43. The interactive artificial intelligence system of exemplary aspect 42, wherein the method comprises receiving an audio description of a medical image, converting the audio description to text using a natural language model, and generating a synthetic medical image based on the converted text description. Exemplary aspect 44. The interactive artificial intelligence system of exemplary aspects 42 or 43, wherein the descriptive text is generated from a natural language model (e.g., LLM). Exemplary aspect 45. The interactive artificial intelligence system of any one of exemplary aspects 42-44, wherein the trained machine learning classifier is trained on a plurality of synthetic medical images. Exemplary aspect 46. The interactive artificial intelligence system of exemplary aspect 45, wherein the plurality of synthetic medical images is augmented before training the machine learning classifier. Exemplary aspect 47. The interactive artificial intelligence system of exemplary aspect 46, wherein augmented comprises adding random noise, random blur, random rotations, random cropping, vertical and horizontal flips, or any combination thereof. [0301] The following patents, applications, and publications, as listed below and throughout this document, are hereby incorporated by reference in their entirety herein. References [0302] The references are numbered according to the Examples 1, 2, 3, and 4 discussed above. Attorney Docket No.10046-568WO1 Example 1 [1] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, and F. Bray, “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: A Cancer Journal for Clinicians, vol. 71, pp. 209 – 249, 2021. [2] A. Jemal, A. Thomas, T. Murray, and M. J. Thun, “Cancer statistics, 2002,” CA: A Cancer Journal for Clinicians, vol.52, 2002. [3] C. Li, S. J. Oh, S. Kim, W. J. Hyung, M. Yan, Z. Zhu, and S. H. Noh, “Macroscopic borrmann type as a simple prognostic indicator in patients with advanced gastric cancer,” Oncology, vol.77, pp.197 – 204, 2009. [4] A. T. R. Axon, M. D. Diebold, M. A. Fujino, R. Fujita, R. M. Genta, J. J. Gonvers, M. B. Guelrud, H. Inoue, M. E. Jung, H. Kashida, S. ei Kudo, R. Lambert, C. J. Lightdale, T. Nakamura, H. Neuhaus, H. Niwa, K. Ogoshi, J. F. Rey, R. H. Riddell, M. Sasako, T. Shimoda, H. Suzuki, G. N. J. Tytgat, K. K. Wang, H. Watanabe, T. Yamakawa, and S. Yoshida, “Update on the paris classification of superficial neoplastic lesions in the digestive tract.” Endoscopy, vol.376, pp.570–8, 2005. [5] S. ei Kudo, S. Hirota, T. Nakajima, S. Hosobe, H. Kusaka, T. Kobayashi, M. Himori, and A. Yagyuu, “Colorectal tumours and pit pattern.” Journal of Clinical Pathology, vol.47, pp.880 – 885, 1994. [6] G. chun Lou, J. min Yang, Q. shun Xu, W. Huang, and S. Shi, “A retrospective study on endoscopic missing diagnosis of colorectal polyp and its related factors.” The Turkish journal of gastroenterology : the official journal of Turkish Society of Gastroenterology, vol.25 Suppl 1, pp.182–6, 2014. [7] Y. Mori, S. ei Kudo, T. M. Berzin, M. Misawa, and K. Takeda, “Computer-aided diagnosis for colonoscopy,” Endoscopy, vol.49, pp.813 – 819, 2017. [8] C. W. Ko and J. A. Dominitz, “Complications of colonoscopy: magnitude and management.” Gastrointestinal endoscopy clinics of North America, vol.204, pp.659–71, 2010. [9] K. Patel, K. Li, K. Tao, Q. Wang, A. Bansal, A. Rastogi, and G. Wang, “A comparative study on polyp classification using convolutional neural networks,” PloS one, vol.15, no.7, p. e0236452, 2020. [10] S.-B. Zhao, S. Wang, P. Pan, T. Xia, X. Chang, X. Yang, L. Guo, Q.-Q. Meng, F. Yang, W. Qian, Z. Xu, Y. Wang, Z. Wang, L. Gu, R. Wang, F. Jia, J. Yao, Z. Li, and Y. Bai, “Magnitude, risk factors, and factors associated with adenoma miss rate of tandem Attorney Docket No.10046-568WO1 colonoscopy: A systematic review and meta-analysis.” Gastroenterology, vol. 1566, pp. 1661–1674.e11, 2019. [11] E. F. Ribeiro, A. Uhl, G. Wimmer, and M. Ha¨fner, “Exploring deep learning and transfer learning for colonic polyp classification,” Computational and Mathematical Methods in Medicine, vol.2016, 2016. [12] S. Patino-Barrientos, D. Sierra-Sosa, B. Garcia-Zapirain, C. Castillo- Olea, and A. S. Elmaghraby, “Kudo’s classification for colon polyps assessment using a deep learning approach,” Applied Sciences, vol.10, p.501, 2020. [13] Y. Wang, Z. Feng, L. Song, X. Liu, and S. Liu, “Multiclassification of endoscopic colonoscopy images based on deep transfer learning,” Computational and Mathematical Methods in Medicine, vol.2021, 2021. [14] L. F. Sa´nchez-Peralta, L. Bote-Curiel, A. Pico´n, F. M. Sa´nchez- Margallo, and J. B. Pagador, “Deep learning to find colorectal polyps in colonoscopy: A systematic literature review,” Artificial intelligence in medicine, vol.108, p.101923, 2020. [15] O. C. Kara, N. Venkatayogi, N. Ikoma, and F. Alambeigi, “A reliable ultaneous type and stage detection of colorectal cancer polyps,” Annals of Biomedical Engineering, 2023. [16] [16] A. Wang, J. Mo, C. Zhong, S. Wu, S. Wei, B. Tu, C. Liu, D. Chen, Q. Xu, M. Cai, Z. Li, W. Xie, M. Xie, M. Kato, X. Xi, and B. Zhang, “Artificial intelligence-assisted detection and classification of colorectal polyps under colonoscopy: a systematic review and meta-analysis,” Annals of Translational Medicine, vol. 9, 2021. [17] Y. Shin and I. Balasingham, “Comparison of hand-craft feature based svm and cnn based deep learning framework for automatic polyp classification,” 201739th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp.3277–3280, 2017. [18] M. Viscaino, J. T. Bustos, P. Mun˜oz, C. A. Cheein, and F. A. A. Chee´ın, “Artificial intelligence for the early detection of colorectal cancer: A comprehensive review of its advantages and misconceptions,” World Journal of Gastroenterology, vol.27, pp.6399 – 6414, 2021. [19] P. Wang, T. M. Berzin, J. R. G. Brown, S. S. Bharadwaj, A. Becq, X. Xiao, P. Liu, L. Li, Y. Song, D. Zhang, Y. Li, G. Xu, M. Tu, and X. Liu, “Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study,” Gut, vol. 68, pp. 1813 – 1819, 2019. Attorney Docket No.10046-568WO1 [20] M. F. Byrne, N. Chapados, F. Soudan, C. Oertel, M. L. Pe´rez, R. Kelly, N. Iqbal, F. Chandelier, and D. K. Rex, “Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model,” Gut, vol. 68, pp. 94 – 100, 2017. [21] P. Wang, X. Liu, T. M. Berzin, J. R. G. Brown, P. Liu, C. Zhou, L. Lei, L. Li, Z. Guo, S. Lei, F. Xiong, H. Wang, Y. Song, Y. Pan, and G. Zhou, “Effect of a deep-learning computer-aided detection system on adenoma detection during colonoscopy (cade-db trial): a double-blind randomised study.” The lancet. Gastroenterology & hepatology, 2020. [22] W. Wang, J. Tian, C. Zhang, Y. Luo, X. Wang, and J. Li, “An improved deep learning approach and its applications on colonic polyp images detection,” BMC Medical Imaging, vol. 20, 2020. [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. [24] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269, 2016. [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, pp. 84 – 90, 2012. [26] K. Huang, Y. Wang, M. Tao, and T. Zhao, “Why do deep residual networks generalize better than deep feedforward networks?—a neural tangent kernel perspective,” Advances in neural information processing systems, vol. 33, pp. 2698–2709, 2020. [27] S. Zhou, C. Chen, G. Han, and X. Hou, “Deep convolutional neural evolution using small size dataset,” 2019 Chinese ntrol Conference (CCC), pp.8568–8572, 2019. [28] S. Poudel, Y. J. Kim, D. M. Vo, and S.-W. Lee, “Colorectal disease classification using efficiently scaled dilation in convolutional neural network,” IEEE Access, vol. 8, pp. 99227–99238, 2020. [29] E. F. Ribeiro, M. Ha¨fner, G. Wimmer, T. Tamaki, J. J. W. Tischendorf, S. Yoshida, S. Tanaka, and A. Uhl, “Exploring texture transfer learning for colonic polyp classification via convolutional neural networks,” 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp.1044–1048, 2017. [30] R. Zhang, Y. Zheng, T. W. C. Mak, R. Yu, S. H. Wong, J. Y. W. Lau, and C. C. Y. Poon, “Automatic detection and classification of colorectal polyps by transferring low-level Attorney Docket No.10046-568WO1 cnn features from nonmedical domain,” IEEE Journal of Biomedical and Health Informatics, vol.21, pp.41–47, 2017. [31] Y. Gal, “Uncertainty in deep learning,” 2016. [32] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” ArXiv, vol. abs/1706.04599, 2017. [33] G. Liang, Y. Zhang, X. Wang, and N. Jacobs, “Improved trainable calibration method for neural networks on medical imaging classification,” ArXiv, vol. abs/2009.04057, 2020. [34] K. R. M. Fernando and C. P. Tsokos, “Dynamically weighted balanced loss: Class imbalanced learning and confidence calibration of deep neural networks,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, pp. 2940–2951, 2022. [35] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” ArXiv, vol. abs/1506.02142, 2015. [36] G. Carneiro, L. Z. C. T. Pu, R. Singh, and A. D. Burt, “Deep learning uncertainty and confidence calibration for the five-class polyp classification m colonoscopy,” Medical image analysis, vol.62, p.101653, 2020. [37] K. C. Kusters, T. Scheeve, N. Dehghani, Q. E. W. van der Zander, R.M. Schreuder, A. A. M. Masclee, E. J. Schoon, F. van der Sommen, and P. H. N. D. with, “Colorectal polyp classification using confidence- calibrated convolutional neural networks,” in Medical Imaging, 2022. [38] S. Kapuria, T. G. Mohanraj, N. Venkatayogi, O. C. Kara, Y. Hirata, P. Minot, A. Kapusta, N. Ikoma, and F. Alambeigi, “Towards reliable colorectal cancer polyps classification via vision based tactile sensing and confidence-calibrated neural networks,” 2023. [39] V. Vovk, A. Gammerman, and G. Shafer, Algorithmic Learning in a Random World, 012005. [40] A. N. Angelopoulos and S. Bates, “A gentle introduction to conformal prediction and distribution-free uncertainty quantification,” ArXiv, vol. abs/2107.07511, 2021. [41] J. Vazquez and J. C. Facelli, “Conformal prediction in clinical medical sciences,” Journal of Healthcare Informatics Research, vol. 6, pp. 241 – 252, 2022. [42] H. Papadopoulos, A. Gammerman, and V. Vovk, “Confidence predictions for the diagnosis of acute abdominal pain,” in Artificial Intelligence Applications and Innovations, 2009. Attorney Docket No.10046-568WO1 [43] Y. Luo, A. A. R. Bsoul, and K. Najarian, “Confidence-based classification with dynamic conformal prediction and its applications in biomedicine,” 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 353–356, 2011. [44] T. Bellotti, Z. Luo, and A. Gammerman, “Reliable classification of childhood acute Leukaemia from gene expression data using confidence machines,” 2006 IEEE International Conference on Granular Computing, pp. 148–153, 2006. [45] L. M. Alnemer, L. Rajab, and I. Aljarah, “Conformal prediction technique to predict breast cancer survivability,” International journal of advanced science and technology, vol. 96, pp. 1–10, 2016. [46] H. Papadopoulos, E. C. Kyriacou, and A. N. Nicolaides, “Unbiased confidence measures for stroke risk estimation based on ultrasound carotid image analysis,” Neural Computing and Applications, vol. 28, pp.1209–1223, 2017. [47] O. C. Kara, N. Ikoma, and F. Alambeigi, “Hysense: A hyper- sensitive and high-fidelity vision-based tactile sensor,” ArXiv, vol. abs/2211.04571, 2022. [48] F. Ku¨ppers, J. Kronenberger, J. Schneider, and A. Haselhoff, “Bayesian confidence calibration for epistemic uncertainty modelling,” 2021 IEEE Intelligent Vehicles Symposium (IV), pp.466–472, 2021. [49] A. N. Angelopoulos, S. Bates, J. Malik, and M. I. Jordan, “Uncertainty sets for image classifiers using conformal prediction,” ArXiv, vol. abs/2009.14193, 2020. [50] J. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” 1999. [51] M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine Learning, vol. 37, pp. 183–233, 1999. [52] M. D. Hoffman, D. M. Blei, C. Wang, and J. W. Paisley, “Stochastic variational inference,” ArXiv, vol. abs/1206.7051, 2012. [53] Y. Romano, M. Sesia, and E. J. Cande`s, “Classification with valid and adaptive coverage,” arXiv: Methodology, 2020. [54] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” CoRR, vol. abs/1511.07122, 2015. [55] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 636–644, 2017. Attorney Docket No.10046-568WO1 [56] N. Venkatayogi, Q. Hu, O. C. Kara, T. G. Mohanraj, S. F. Atashzar, and F. Alambeigi, “Pit-pattern classification of colorectal cancer polyps using a hyper sensitive vision-based tactile sensor and dilated residual networks,” ArXiv, vol. abs/2211.06814, 2022. [57] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large- scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009. [58] N. Venkatayogi, O. C. Kara, J. Bonyun, N. Ikoma, and F. Alambeigi, “Classification of colorectal cancer polyps via transfer learning and vision-based tactile sensing,” ArXiv, vol. abs/2211.04573, 2022. Example 2 [1] Siegel, R. L., Miller, K. D., Wagle, N. S., and Jemal, A., “Cancer statistics, 2023,” CA: A Cancer Journal for Clinicians 73, 17 – 48 (2023). [2] Jemal, A., Thomas, A., Murray, T., and Thun, M. J., “Cancer statistics, 2002,” CA: A Cancer Journal for Clinicians 52 (2002). [3] Li, C., Oh, S. J., Kim, S., Hyung, W. J., Yan, M., Zhu, Z., and Noh, S. H., “Macroscopic borrmann type as a simple prognostic indicator in patients with advanced gastric cancer,” Oncology 77, 197 – 204 (2009). [4] Axon, A. T. R., Diebold, M. D., Fujino, M. A., Fujita, R., Genta, R. M., Gonvers, J. J., Guelrud, M. B., Inoue, H., Jung, M. E., Kashida, H., ei Kudo, S., Lambert, R., Lightdale, C. J., Nakamura, T., Neuhaus, H., Niwa, H., Ogoshi, K., Rey, J. F., Riddell, R. H., Sasako, M., Shimoda, T., Suzuki, H., Tytgat, G. N. J., Wang, K. K., Watanabe, H., Yamakawa, T., and Yoshida, S., “Update on the paris classification of superficial neoplastic lesions in the digestive tract.,” Endoscopy 376, 570–8 (2005). [5] ei Kudo, S., Hirota, S., Nakajima, T., Hosobe, S., Kusaka, H., Kobayashi, T., Himori, M., and Yagyuu, A., “Colorectal tumours and pit pattern.,” Journal of Clinical Pathology 47, 880 – 885 (1994). [6] chun Lou, G., min Yang, J., shun Xu, Q., Huang, W., and Shi, S., “A retrospective study on endoscopic missing diagnosis of colorectal polyp and its related factors.,” The Turkish journal of gastroenterology : the official journal of Turkish Society of Gastroenterology 25 Suppl 1, 182–6 (2014). [7] Mori, Y., ei Kudo, S., Berzin, T. M., Misawa, M., and Takeda, K., “Computer-aided diagnosis for colonoscopy,” Endoscopy 49, 813 – 819 (2017). Attorney Docket No.10046-568WO1 [8] Zhao, S.-B., Wang, S., Pan, P., Xia, T., Chang, X., Yang, X., Guo, L., Meng, Q.-Q., Yang, F., Qian, W., Xu, Z., Wang, Y., Wang, Z., Gu, L., Wang, R., Jia, F., Yao, J., Li, Z., and Bai, Y., “Magnitude, risk factors, and factors associated with adenoma miss rate of tandem colonoscopy: A systematic review and meta-analysis.,” Gastroenterology 1566, 1661– 1674.e11 (2019). [9] Kara, O. C., Venkatayogi, N., Ikoma, N., and Alambeigi, F., “A reliable and sensitive framework for simultaneous type and stage detection of colorectal cancer polyps,” Annals of Biomedical Engineering (2023). [10] Wang, A., Mo, J., Zhong, C., Wu, S., Wei, S., Tu, B., Liu, C., Chen, D., Xu, Q., Cai, M., Li, Z., Xie, W., Xie, M., Kato, M., Xi, X., and Zhang, B., “Artificial intelligence- assisted detection and classification of colorectal polyps under colonoscopy: a systematic review and meta-analysis,” Annals of Translational Medicine 9 (2021). [11] Shin, Y. and Balasingham, I., “Comparison of hand-craft feature based svm and cnn based deep learning framework for automatic polyp classification,” 201739th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) , 3277–3280 (2017). [12] Viscaino, M., Bustos, J. T., Mun˜oz, P., Cheein, C. A., and Chee´ın, F. A. A., “Artificial intelligence for the early detection of colorectal cancer: A comprehensive review of its advantages and misconceptions,” World Journal of Gastroenterology 27, 6399 – 6414 (2021). [13] Ribeiro, E. F., Uhl, A., Wimmer, G., and H¨afner, M., “Exploring deep learning and transfer learning for colonic polyp classification,” Computational and Mathematical Methods in Medicine 2016 (2016). [14] S´anchez-Peralta, L. F., Bote-Curiel, L., Pic´on, A., S´anchez-Margallo, F. M., and Pagador, J. B., “Deep learning to find colorectal polyps in colonoscopy: A systematic literature review,” Artificial intelligence in medicine 108, 101923 (2020). [15] Wang, W., Tian, J., Zhang, C., Luo, Y., Wang, X., and Li, J., “An improved deep learning approach and its applications on colonic polyp images detection,” BMC Medical Imaging 20 (2020). [16] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 770–778 (2016). Attorney Docket No.10046-568WO1 [17] Huang, G., Liu, Z., and Weinberger, K. Q., “Densely connected convolutional networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2261–2269 (2016). [18] Krizhevsky, A., Sutskever, I., and Hinton, G. E., “Imagenet classification with deep convolutional neural networks,” Communications of the ACM 60, 84 – 90 (2012). [19] Zhou, S., Chen, C., Han, G., and Hou, X., “Deep convolutional neural network with dilated convolution using small size dataset,” 2019 Chinese Control Conference (CCC) , 8568–8572 (2019). [20] Poudel, S., Kim, Y. J., Vo, D. M., and Lee, S.-W., “Colorectal disease classification using efficiently scaled dilation in convolutional neural network,” IEEE Access 8, 99227–99238 (2020). [21] Krause, J., Grabsch, H. I., Kloor, M., Jendrusch, M., Echle, A., Bu¨low, R. D., Boor, P., Luedde, T., Brinker, T. J., Trautwein, C., Pearson, A. T., Quirke, P., Jenniskens, J. C. A., Offermans, K., van den Brandt, P. A., and Kather, J. N., “Deep learning detects genetic alterations in cancer histology generated by adversarial networks,” The Journal of Pathology 254 (2021). [22] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., and Bengio, Y., “Generative adversarial networks,” Commun. ACM 63, 139–144 (2014). [23] Ho, J., Jain, A., and Abbeel, P., “Denoising diffusion probabilistic models,” ArXiv abs/2006.11239 (2020). [24] Rezende, D. J., Mohamed, S., and Wierstra, D., “Stochastic backpropagation and approximate inference in deep generative models,” in [International Conference on Machine Learning ], (2014). [25] Dinh, L., Sohl-Dickstein, J. N., and Bengio, S., “Density estimation using real nvp,” ArXiv abs/1605.08803 (2016). [26] Price, W. N. and Cohen, I. G., “Privacy in the age of medical big data,” Nature Medicine 25, 37–43 (2019). [27] Deshpande, S., Minhas, F. A., and Rajpoot, N. M., “Train small, generate big: Synthesis of colorectal cancer histology images,” in [SASHIMI@MICCAI ], (2020). [28] Deshpande, S., Minhas, F. A., Graham, S., and Rajpoot, N. M., “Safron: Stitching across the frontier for generating colorectal cancer histology images,” ArXiv abs/2008.04526 (2020). Attorney Docket No.10046-568WO1 [29] Venkatayogi, N., Hu, Q., Kara, O. C., Mohanraj, T. G., Atashzar, S. F., and Alambeigi, F., “Pit-pattern classification of colorectal cancer polyps using a hyper sensitive vision-based tactile sensor and dilated residual networks,” ArXiv abs/2211.06814 (2022). [30] Venkatayogi, N., Kara, O. C., Bonyun, J., Ikoma, N., and Alambeigi, F., “Classification of colorectal cancer polyps via transfer learning and vision-based tactile sensing,” ArXiv abs/2211.04573 (2022). [31] Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional networks for biomedical image segmentation,” ArXiv abs/1505.04597 (2015). [32] Kara, O. C., Ikoma, N., and Alambeigi, F., “Hysense: A hyper-sensitive and high-fidelity vision-based tactile sensor,” ArXiv abs/2211.04571 (2022). [33] Kara, O. C., Kim, H. S., Xue, J., Mohanraj, T. G., Hirata, Y., Ikoma, N., and Alambeigi, F., “Design and development of a novel soft and inflatable tactile sensing balloon for early diagnosis of colorectal cancer polyps,” ArXiv (2023). [34] Kim, H., Kara, O. C., and Alambeigi, F., “A soft and inflatable vision-based tactile sensor for inspection of constrained and confined spaces,” IEEE Sensors Journal (2023). [35] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., and Brew, J., “Huggingface’s transformers: State-of- the-art natural language processing,” ArXiv abs/1910.03771 (2019). [36] Arora, A. and Arora, A., “Generative adversarial networks and synthetic patient data: current challenges and future perspectives,” Future Healthcare Journal 9, 190 – 193 (2022). [37] Huang, K., Wang, Y., Tao, M., and Zhao, T., “Why do deep residual networks generalize better than deep feedforward networks?—a neural tangent kernel perspective,” Advances in neural information processing systems 33, 2698–2709 (2020). [38] Wang, Y., Feng, Z., Song, L., Liu, X., and Liu, S., “Multiclassification of endoscopic colonoscopy images based on deep transfer learning,” Computational and Mathematical Methods in Medicine 2021 (2021). [39] Yu, F., Koltun, V., and Funkhouser, T. A., “Dilated residual networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 636–644 (2017). [40] Simonyan, K. and Zisserman, A., “Very deep convolutional networks for large-scale image recognition,” CoRR abs/1409.1556 (2014). Attorney Docket No.10046-568WO1 [41] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., “Imagenet: A large-scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition , 248–255 (2009). [42] Kapuria, S., Mohanraj, T. G., Venkatayogi, N., Kara, O. C., Hirata, Y., Minot, P., Kapusta, A., Ikoma, N., and Alambeigi, F., “Towards reliable colorectal cancer polyps classification via vision based tactile sensing and confidence-calibrated neural networks,” (2023). [43] Kingma, D. P. and Ba, J., “Adam: A method for stochastic optimization,” CoRR abs/1412.6980 (2014). [44] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A., “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 1–9 (2014). Example 3 [1] H. Sung, et al., “Global cancer statistics 2020: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,” CA: A Cancer Journal for Clinicians, vol.71, pp.209 – 249, 2021. [2] National Cancer Institute, “Seer cancer stat facts: Stomach cancer,” 2021. [3] R. M. Gore, et al., “Stomach malignant tumors,” 2013. [4] M. E. Lockhart and C. L. Canon, “Epidemiology of gastric cancer,” 2009. [5] Q. Zhang, et al., “Training in early gastric cancer diagnosis improves the detection rate of early gastric cancer,” Medicine, vol.94, 2015. [6] L. Ginzburg, D. A. Greenwald, and J. Cohen, “Complications of endoscopy.” Gastrointestinal endoscopy clinics of North America, vol.172, pp.405–32, 2007. [7] Y. Li, X. Li, X. Xie, and L. Shen, “Deep learning based gastric cancer identification,” 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp.182–185, 2018. [8] B. Huang, et al., “Accurate diagnosis and prognosis prediction of gastric cancer using deep learning on digital pathological images: A retrospective multicentre study,” EBioMedicine, vol.73, 2021. [9] T. Hirasawa, et al., “Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images,” Gastric Cancer, vol.21, pp.653– 660, 2018. Attorney Docket No.10046-568WO1 [10] J. Xia, et al., “Use of artificial intelligence for detection of gastric lesions by magnetically controlled capsule endoscopy.” Gastrointestinal endoscopy, 2020. [11] S. Fotouhi, S. Asadi, and M. W. Kattan, “A comprehensive data level analysis for cancer diagnosis on imbalanced data,” Journal of biomedical informatics, vol.90, p.103089, 2019. [12] American Medical Association, “Ama passes first policy recommendations on augmented intelligence,” 2023. [13] C. D. Christou and G. Tsoulfas, “Challenges and opportunities in the application of artificial intelligence in gastroenterology and hepatology,” World Journal of Gastroenterology, vol.27, pp.6191 – 6223, 2021. [14] P. Jin, et al., “Artificial intelligence in gastric cancer: a systematic review,” Journal of Cancer Research and Clinical Oncology, pp.1– 12, 2020. [15] O. C. Kara, et al., “A smart handheld edge device for on-site diagnosis and classification of texture and stiffness of excised colorectal cancer polyps,” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.4662–4668, 2023. [16] N. Venkatayogi, et al., “On the potentials of surface tactile imaging and dilated residual networks for early detection of colorectal cancer polyps,” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.4655–4661, 2023. [17] O. C. Kara, N. Venkatayogi, N. Ikoma, and F. Alambeigi, “A reliable and sensitive framework for simultaneous type and stage detection of colorectal cancer polyps,” Annals of Biomedical Engineering, 2023. [18] S. ei Kudo, et al., “Colorectal tumours and pit pattern.” Journal of Clinical Pathology, vol.47, pp.880 – 885, 1994. [19] L. Wang, et al., “Comparing single oral contrast-enhanced ultrasonography and double contrast-enhanced ultrasonography in the preoperative borrmann classification of advanced gastric cancer,” Oncotarget, vol.9, pp.8716 – 8724, 2015. [20] K. Hosoda, M. Watanabe, and K. Yamashita, “Re-emerging role of macroscopic appearance in treatment strategy for gastric cancer,” Annals of Gastroenterological Surgery, vol.3, pp.122 – 129, 2018. [21] O. C. Kara, N. Ikoma, and F. Alambeigi, “Hysense: A hyper-sensitive and high- fidelity vision-based tactile sensor,” ArXiv, vol. abs/2211.04571, 2022. [22] C. Hennersperger, et al., “Towards mri-based autonomous robotic us acquisitions: a first feasibility study,” IEEE transactions on medical imaging, vol.36, no.2, pp.538–548, 2017. Attorney Docket No.10046-568WO1 [23] R. Hartley and A. Zisserman, “Multiple view geometry in computer vision,” null, 2000. [24] D. Brown, “Decentering distortion of lenses,” null, 1966. [25] Z. Zhang and Z. Zhang, “A flexible new technique for camera calibration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. [26] R. Horaud and F. Dornaika, “Hand-eye calibration,” The International Journal of Robotics Research, 1995. [27] S. Kapuria, et al., “Towards reliable colorectal cancer polyps classification via vision based tactile sensing and confidence-calibrated neural networks,” 2023 International Symposium on Medical Robotics (ISMR), pp.1–7, 2023. [28] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [29] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.636–644, 2017. [30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770–778, 2016. [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol.60, pp.84 – 90, 2012. [32] L. N. Smith and N. Topin, “Super-convergence: very fast training of neural networks using large learning rates,” in Defense + Commercial Sensing, 2018. [33] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv: Learning, 2016. [34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [35] L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive gradient methods with dynamic bound of learning rate,” ArXiv, vol. abs/1902.09843, 2019. [36] L. Biewald, “Experiment tracking with weights and biases,” 2020, software available from wandb.com Example 4 [1] Sung, H., Ferlay, J., Siegel, R. L., Laversanne, M., Soerjomataram, I., Jemal, A., and Bray, F., “Global cancer statistics 2020: Globocan estimates of incidence and mortality Attorney Docket No.10046-568WO1 worldwide for 36 cancers in 185 countries,” CA: A Cancer Journal for Clinicians 71, 209 – 249 (2021). [2] National Cancer Institute, “Seer cancer stat facts: Stomach cancer,” (2021). [3] Gore, R. M., Thakrar, K. H., Newmark, G. M., Wenzke, D. R., Mehta, U. K., and Berlin, J. W., “Stomach malignant tumors,” (2013). [4] Lockhart, M. E. and Canon, C. L., “Epidemiology of gastric cancer,” (2009). [5] Zhang, Q., yu Chen, Z., di Chen, C., Liu, T., Tang, X., Ren, Y., lin Huang, S., Cui, X., li An, S., Xiao, B., Bai, Y., Liu, S.-D., Jiang, B., chao Zhi, F., Gong, W., and Nzeako, U. C., “Training in early gastric cancer diagnosis improves the detection rate of early gastric cancer,” Medicine 94 (2015). [6] Ginzburg, L., Greenwald, D. A., and Cohen, J., “Complications of endoscopy.,” Gastrointestinal endoscopy clinics of North America 172, 405–32 (2007). [7] Li, Y., Li, X., Xie, X., and Shen, L., “Deep learning based gastric cancer identification,” 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018) , 182–185 (2018). [8] Huang, B., Tian, S., Zhan, N., Ma, J., Huang, Z., Zhang, C., Zhang, H., Ming, F., Liao, F., Ji, M., Zhang, J., Liu, Y., He, P., Deng, B.-B., yuan Hu, J., and Dong, W., “Accurate diagnosis and prognosis prediction of gastric cancer using deep learning on digital pathological images: A retrospective multicentre study,” EBioMedicine 73 (2021). [9] Hirasawa, T., Aoyama, K., Tanimoto, T., Ishihara, S., Shichijo, S., Ozawa, T., Ohnishi, T., Fujishiro, M., Matsuo, K., Fujisaki, J., and Tada, T., “Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images,” Gastric Cancer 21, 653–660 (2018). [10] Xia, J., Xia, T., Pan, J., Gao, F., Wang, S., Qian, Y., Wang, H., Zhao, J., Jiang, X., Zou, W., Wang, Y.-C., Zhou, W., Li, Z., and Liao, Z., “Use of artificial intelligence for detection of gastric lesions by magnetically controlled capsule endoscopy.,” Gastrointestinal endoscopy (2020). [11] Fotouhi, S., Asadi, S., and Kattan, M. W., “A comprehensive data level analysis for cancer diagnosis on imbalanced data,” Journal of biomedical informatics 90, 103089 (2019). [12] American Medical Association, “Ama passes first policy recommendations on augmented intelligence,” (2023). [13] Christou, C. D. and Tsoulfas, G., “Challenges and opportunities in the application of artificial intelligence in gastroenterology and hepatology,” World Journal of Gastroenterology 27, 6191 – 6223 (2021). Attorney Docket No.10046-568WO1 [14] Kara, O. C., Xue, J., Venkatayogi, N., Mohanraj, T. G., Hirata, Y., Ikoma, N., Atashzar, S. F., and Alambeigi, F., “A smart handheld edge device for on-site diagnosis and classification of texture and stiffness of excised colorectal cancer polyps,” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 4662–4668 (2023). [15] Venkatayogi, N., Hu, Q., Kara, O. C., Mohanraj, T. G., Atashzar, S. F., and Alambeigi, F., “On the potentials of surface tactile imaging and dilated residual networks for early detection of colorectal cancer polyps,” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 4655–4661 (2023). [16] Kara, O. C., Venkatayogi, N., Ikoma, N., and Alambeigi, F., “A reliable and sensitive framework for simultaneous type and stage detection of colorectal cancer polyps,” Annals of Biomedical Engineering (2023). [17] Kim, H., Kara, O. C., and Alambeigi, F., “A soft and inflatable vision-based tactile sensor for inspection of constrained and confined spaces,” IEEE Sensors Journal (2023). [18] Kapuria, S., Ikoma, N., Chinchali, S., and Alambeigi, F., “Enhancing colorectal cancer diagnosis through generative models and vision-based tactile sensing: a sim2real study,” in [Medical Imaging ], (2024). [19] Kapuria, S., Bonyun, J., Kulkarni, Y., Ikoma, N., Chinchali, S., and Alambeigi, F., “Robot- enabled machine learning-based diagnosis of gastric cancer polyps using partial surface tactile imaging,” ArXiv abs/2408.01554 (2022). [20] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B., “High- resolution image synthesis with latent diffusion models,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 10674–10685 (2021). [21] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S., “Gans trained by a two time- scale update rule converge to a local nash equilibrium,” in [NIPS ], (2017). [22] Kara, O. C., Ikoma, N., and Alambeigi, F., “Hysense: A hyper-sensitive and high- fidelity vision-based tactile sensor,” ArXiv abs/2211.04571 (2022). [23] Kapuria, S., Mohanraj, T. G., Venkatayogi, N., Kara, O. C., Hirata, Y., Minot, P. R., Kapusta, A., Ikoma, N., and Alambeigi, F., “Towards reliable colorectal cancer polyps classification via vision based tactile sensing and confidence-calibrated neural networks,” 2023 International Symposium on Medical Robotics (ISMR) , 1–7 (2023). [24] Lopez, R., Boyeau, P., Yosef, N., Jordan, M. I., and Regier, J., “Auto-encoding variational bayes,” (2020). Attorney Docket No.10046-568WO1 [25] Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional networks for biomedical image segmentation,” ArXiv abs/1505.04597 (2015). [26] Song, J., Meng, C., and Ermon, S., “Denoising diffusion implicit models,” ArXiv abs/2010.02502 (2020). [27] Zhao, W., Bai, L., Rao, Y., Zhou, J., and Lu, J., “Unipc: A unified predictor- corrector framework for fast sampling of diffusion models,” ArXiv abs/2302.04867 (2023). [28] He, K., Zhang, X., Ren, S., and Sun, J., “Deep residual learning for image recognition,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 770–778 (2016). [29] Yu, F., Koltun, V., and Funkhouser, T. A., “Dilated residual networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 636–644 (2017). [30] Lin, S., Liu, B., Li, J., and Yang, X., “Common diffusion noise schedules and sample steps are flawed,” 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , 5392–5399 (2023). [31] Ho, J., “Classifier-free diffusion guidance,” ArXiv abs/2207.12598 (2022). [32] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L., “Imagenet: A large-scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition , 248–255 (2009). [33] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A., “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 1–9 (2014).

Claims

Attorney Docket No.10046-568WO1 Claims What is claimed is: 1. A method for post-processing a machine learning classification, the method comprising: classifying a dataset as one or more classes using a trained machine learning classifier and calculating an associated probability score for each of the one or more classes; calibrating the probability score of the classification of the trained machine learning classifier using a regression operator; and displaying, as a report, the calibrated probability score for the classification of the machine learning classifier. 2. The method of claim 1, further comprising ranking the one or more classes of the classification using a Naïve Conformal prediction algorithm or regularized adaptive prediction sets algorithm. 3. The method of claim 2, the method further comprising receiving an error rate variable from a user using an input means. 4. The method of any one of claims 1-3, wherein calibrating temperature scaling, variational temperature scaling, or a combination thereof. 5. The method of claim 4, wherein the calibrated probability of the ranked classes is output up to one minus the error rate variable. 6. The method of any one of claims 1-5, wherein a penultimate layer of the machine learning classifier comprises a Softmax function. 7. The method of any one of claims 1-6, wherein the report or display comprises the probability score and associated class. 8. The method of any one of claims 1-7, wherein the report or display comprises the probability score, associated class, and a visual representation of the dataset. 9. The method of any one of claims 1-8, wherein the dataset comprises a medical image. Attorney Docket No.10046-568WO1 10. The method of any one of claims 1-9, wherein the dataset comprises an engineering image. 11. The method of any one of claims 1-10, wherein the dataset comprises a synthetically produced image. 12. The method of claim 11, wherein the machine learning classifier is trained on a plurality of synthetically produced images. 13. The method of claims 11 or 12, wherein a generative model is used to produce the synthetic image or plurality of synthetic images. 14. The method of any one of claims 1-13, wherein the machine learning classifier is trained on a plurality of augmented images. 15. The method of claim 14, wherein augmentation of the plurality of augmented images comprises adding random noise, random blur, random rotations, random cropping, vertical and horizontal flips, or any combination thereof. 16. The method of any one of claims 1-15, wherein the machine learning classifier comprises a support vector machine (SVM), a neural network, a convoluted neural network (CNN), a densely connected CNN, a residual network, or a combination thereof. 17. The method of any one of claims 12-16, wherein a class distribution is determined using cross-validation of the trained machine learning classifier. 18. The method of claim 17, wherein synthetic data are produced based on the class distribution of the machine learning classifier, and wherein the machine learning classifier is retrained using the same. Attorney Docket No.10046-568WO1 19. The method of any one of claims 1-18, the method further comprises calculating a heat map of the dataset. 20. The method of claim 19, wherein the heat map of the dataset is trained using color matching. 21. The method of claim 20, the method further comprising adding descriptive text to the report. 22. The method of claim 21, wherein the descriptive text may be generated by a natural language model. 23. A system for post-processing a machine learning classification, the system comprising: one or more processors; an output device; and a memory, the memory storing instructions thereon, that when executed by the one or more processors, causes the one or more processors to perform a method, the method comprising: classifying a dataset as one or more classes using a machine learning classifier and calculating an associated probability score for each of the one or more classes to generate a trained machine learning classifier; calibrating the probability score of the classification of the machine learning classifier using a regression operator; and displaying as a report or display, on the output device, the calibrated probability score for the classification of the machine learning classifier. 24. The system of claim 23, wherein the one or more processors are connected by any communication means. 25. The system of claims 23 or 24, wherein the trained machine learning classifier is executed on a first processor of the one or more processors. Attorney Docket No.10046-568WO1 26. The system of any one of claims 23-25, the system being configured to rank the one or more classes of the classification using a Naïve Conformal prediction algorithm or regularized adaptive prediction sets algorithm. 27. The system of any one of claims 23-36, the system being configured to receive an error rate variable from a user using an input means. 28. The system of any one of claims 23-27, wherein calibrating comprises temperature scaling, variational temperature scaling, or a combination thereof. 29. The system of any one of claims 23-28, wherein the calibrated probability of the ranked classes is output up to one minus the error rate variable. 30. The system of any one of claims 23-29, wherein a penultimate layer of the machine learning classifier comprises a Softmax function. 31. The system of any one of claims 23-30, wherein the report or display comprises the probability score and associated class. 32. The system of any one of claims 23-31, wherein the report comprises the probability score, associated class, and a visual representation of the dataset. 33. The system of any one of claims 23-32, wherein the dataset comprises a medical image. 34. The system of any one of claims 23-33, wherein the dataset comprises an engineering image. 35. The system for post-processing classifications of any one of claims 23-34, wherein the dataset comprises a synthetically produced image. 36. The system of any one of claims 23-35, wherein the machine learning classifier is trained on a plurality of synthetically produced images. Attorney Docket No.10046-568WO1 37. The system of any one of claims 35-36, wherein a generative model is used to produce the synthetic image or plurality of synthetic images. 38. The system of any one of claims 23-37, wherein the machine learning classifier is trained on a plurality of augmented images. 39. The system of claim 38, wherein augmentation of the plurality of augmented images comprises adding random noise, random blur, random rotations, random cropping, vertical and horizontal flips, or any combination thereof. 40. The system of any one of claims 23-39, wherein the machine learning classifier comprises, a support vector machine (SVM), a neural network, a convoluted neural network (CNN), a densely connected CNN, a residual network, or a combination thereof. 41. The system of any one of claims 36-39, wherein a class distribution is determined using cross-validation of the trained machine learning classifier. 42. The system of claim 41, wherein synthetic data are produced based on the class distribution of the machine learning classifier, and wherein the machine learning classifier is retrained using the same. 43. The system of any one of claims 23-42, the method further comprises calculating a heat map of the dataset. 44. The system of claim 43, wherein the heat map of the dataset is trained using color matching. 45. The system of claim 44, the method further comprising adding descriptive text to the output display. Attorney Docket No.10046-568WO1 46. The system of claim 45, wherein the descriptive text may be generated by a natural language model. 47. The system of any one of claims 23-46, wherein the dataset is a plurality of medical images. 48. An interactive artificial intelligence system, the system comprising: one or more processors; an output device; an input device; and two or more data storage devices, a first data storage device storing instructions thereon, that when executed by the one or more processors, causes the one or more processors to perform a method, the method comprising: classifying an image and generating an associated probability score using a trained machine learning classifier; receiving, from a user by an input device, an error rate variable value; calibrating the probability score of the image classification of the machine learning classifier using a regression operator wherein the confidence of the calibrated probability score is related to the user’s error rate variable value; calculating an attention map of the image; adding descriptive text to the attention map; displaying, on the output device, the calibrated probability score for the image classification of the machine learning classifier, the attention map and descriptive text. 49. The interactive artificial intelligence system of claim 48, wherein the method comprises receiving an audio description of a medical image, converting the audio description to text using a natural language model, and generating a synthetic medical image based on the converted text description. 50. The interactive artificial intelligence system of claims 48 or 49, wherein the descriptive text is generated from a natural language model. Attorney Docket No.10046-568WO1 51. The interactive artificial intelligence system of any one of claims 48-50, wherein the trained machine learning classifier is trained on a plurality of synthetic medical images. 52. The interactive artificial intelligence system of claim 51, wherein the plurality of synthetic medical images is augmented before training the machine learning classifier. 53. The interactive artificial intelligence system of claim 52, wherein augmented comprises adding random noise, random blur, random rotations, random cropping, vertical and horizontal flips, or any combination thereof. 54. The interactive artificial intelligence system of any one of claims 51-53, wherein a class distribution is determined using cross-validation of the trained machine learning classifier. 55. The interactive artificial intelligence system of claim 54, wherein synthetic data are produced based on the class distribution of the machine learning classifier, and wherein the machine learning classifier is retrained using the same.