WO2025064547A1 - Machine learning classification of solid tumors based on gene expression - Google Patents
Machine learning classification of solid tumors based on gene expression Download PDFInfo
- Publication number
- WO2025064547A1 WO2025064547A1 PCT/US2024/047289 US2024047289W WO2025064547A1 WO 2025064547 A1 WO2025064547 A1 WO 2025064547A1 US 2024047289 W US2024047289 W US 2024047289W WO 2025064547 A1 WO2025064547 A1 WO 2025064547A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cancer
- solid tumor
- tumor
- patient
- genes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
- G16B25/10—Gene or protein expression profiling; Expression-ratio estimation or normalisation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Definitions
- lung nodules are common, often detected in screenings of patients experiencing no symptoms of lung disease. Among subjects having lung nodules, only a fraction are eventually diagnosed with a cancer. Noncancerous causes of lung nodules can include e.g., mycobacterial or fungal infection, autoimmune diseases, air pollutants, and scarring from previous insult. Large lung nodules typically warrant an invasive biopsy or removal by thoracic surgery. The percentage of lung nodules eventually identified as cancerous has been estimated to be as low as 40%. Given the potential harm of biopsy or thoracic surgery, less invasive testing for lung cancer is needed.
- the present disclosure provides systems and methods for assessing a solid tumor of a patient using machine learning and uses thereof. In some embodiments, methods and systems for assessing a solid tumor of a patient by classifying gene expression with machine learning. Certain Embodiments [0004] The present disclosure provides systems and methods for assessing a solid tumor of a patient using machine learning and uses thereof.
- PBMCs peripheral blood mononuclear cells
- the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor; (ii) a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer; (iii) an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer; (iv) a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer; (v) a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer; (vi) a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer; (vii) an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer; (viii) a brain tumor, and the one or more
- a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant
- the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is benign or malignant, Attorney Docket No.225234-718601/PCT based at least in part on one or more predictors
- PBMCs peripheral blood monon
- the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor; (ii) a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer; (iii) an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer; (iv) a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer; (v) a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer; (vi) a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer; (vii) an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer; (viii) a brain tumor, and the one or more
- the A predictors have top 5 to 200 feature importance values.
- the trained machine learning model has: (i) an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) a sensitivity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) a specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at
- the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof.
- PBMCs peripheral blood mononuclear cells
- the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor. [0014] In some embodiments, the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set described herein, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set described herein, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is a brain tumor, and the at least 2 genes selected are from the gene set described herein, wherein the gene set is capable of classifying the brain tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a kidney tumor
- the at least 2 genes selected are from the gene set described herein, wherein the gene set is capable of classifying the kidney tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer.
- the machine-learning model is trained according to any of the methods described herein.
- the method described herein wherein: (i) the patient has cancer; (ii) the patient does not have cancer; (iii) the patient is at an elevated risk of having cancer; or (iv) the patient is asymptomatic for cancer; optionally wherein the cancer is a pancreatic cancer, an ovarian cancer, or a brain cancer.
- the method described herein further comprising administering a treatment based on a solid tumor of a patient being classified as malignant.
- the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof.
- the inference includes a confidence value between 0 and 1 that the solid tumor is malignant.
- the method described herein comprising: (i) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than
- a machine learning model is trained and has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set described herein, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof;
- PBMCs peripheral blood mononuclear cells
- the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof
- PBMCs peripheral blood mononuclear cells
- the cancer is a pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer.
- systems for assessing a solid tumor of a patient comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: Attorney Docket No.225234-718601/PCT obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set described herein, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide a dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine learning model, the
- non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set described herein, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide a data set as an input to a machine learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor; receive, as an output of the machine learning model, the inference indicating whether a composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
- PBMCs peripheral blood mono
- a method for obtaining a gene set capable of classifying whether a patient has cancer comprising: (a) providing a dataset as an input to a machine learning classifier, said dataset comprises or is derived from gene expression measurements from a plurality of reference samples, of genes listed in a plurality of gene modules, wherein the plurality of reference samples comprises a first plurality of reference samples obtained or derived from reference subjects having Attorney Docket No.225234-718601/PCT cancer, and a second plurality of reference samples obtained or derived from reference subjects not having cancer; (b) performing feature selection to select a subset of gene modules from the plurality of gene modules, wherein the plurality of gene modules form features of the machine learning classifier; and (c) selecting differentially expressed genes from the genes listed in the subset of gene modules to obtain the gene set capable of classifying whether the patient has cancer.
- the machine learning classifier is sequential grouped feature importance (SGFI) algorithm.
- feature selection comprises starting from a featureless model, and sequentially adding next best feature using leave-one-group-in importance (LOGI) until no further improvement in mean misclassification error (MMCE) over an improvement threshold is achieved.
- the improvement threshold is 0.00001, 0.00005, 0.0001, 0.0005, or 0.001.
- the dataset is a batch corrected dataset.
- the plurality of gene modules are obtained by a method comprising: providing an initial data set comprising gene expression measurement from the plurality of reference samples, of genes of an initial gene set; selecting M genes from the initial gene set, wherein said M genes are M variably expressed genes of the initial data set, and wherein M is an integer; and clustering the M genes to obtain the plurality of gene modules.
- the M genes are clustered based on protein-protein interaction of proteins encoded by the M genes.
- the M genes are M most variably expressed genes of the initial data set.
- M is 500 to 10000.
- the method described herein further comprising analyzing a patient data set comprising or derived from gene expression measurement of at least 2 genes selected from the genes within the gene set obtained in step (c) to classify whether a patient has cancer, wherein the gene expression measurement is obtained from a biological sample obtained or derived from the patient.
- the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 110, 120, 130, 140, 150, 151, 152, 153, 154, 155, 156, 157,
- the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c).
- the method described herein wherein the method: (i) classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%
- analyzing the patient data set comprises providing the patient data set as an input to a machine-learning model trained to generate an inference indicative of whether the patient has cancer based on the patient data set.
- the method described herein further comprising: a) receiving as an output of the machine-learning model the inference; and b) electronically outputting a report classifying whether the patient has cancer based on the inference.
- the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof.
- the machine-learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer.
- the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- the method described herein further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer.
- the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof.
- the cancer is a solid cancer; optionally wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer.
- the solid cancer is adrenal cancer, anal cancer
- the cancer is a blood cancer; optionally wherein the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post-transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia.
- a patient data set comprising or derived from gene expression measurements of at least 2 genes selected from genes within the gene set obtained in step (c) described herein as an input to a machine learning model trained to generate an inference of whether the patient data set is indicative of the patient having cancer; receiving, as an output of the machine learning model the inference; and electronically outputting a report classifying whether the patient has cancer based on the inference, wherein the gene expression measurements are obtained from a biological sample obtained or derived from the patient.
- the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, Attorney Docket No.225234-718601/PCT 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
- the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c).
- the patient data set is derived from the gene expression measurements using gene set variation analysis (GSVA), gene set enrichment analysis (GSEA), enrichment algorithm, multiscale embedded gene co-expression network analysis (MEGENA), weighted gene co-expression network analysis (WGCNA), differential expression analysis, Z-score, log2 expression analysis, or any combination thereof.
- GSVA gene set variation analysis
- GSEA gene set enrichment analysis
- MEGENA multiscale embedded gene co-expression network analysis
- WGCNA weighted gene co-expression network analysis
- differential expression analysis Z-score
- log2 expression analysis or any combination thereof.
- the machine learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof.
- the method classifies whether the patient has cancer: (i) with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at
- the machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer.
- the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- the cancer is a solid cancer; optionally wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer.
- the solid cancer is adrenal cancer, anal cancer
- the cancer is a blood cancer; optionally wherein the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post-transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia.
- the method described herein further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer.
- the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof.
- FIG.1A is a receiver operating characteristic (ROC) plot showing performance of eight machine learning classifiers using a set of 1,178 gene features generated from ribonucleic acid (RNA) sequencing (RNA-Seq) data to distinguish malignant lung nodules versus benign lung nodules.
- RNA ribonucleic acid
- RNA-Seq ribonucleic acid sequencing
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.1B shows results of exemplary trained machine learning classifier algorithms to analyze RNA Seq data using the set of 1,178 gene features to distinguish malignant lung nodules versus benign lung nodules.
- FIG.2A is a ROC plot for an optimization of differentially expressed genes to distinguish malignant lung nodules versus benign lung nodules based on an analysis of RNA-Seq data.
- the six machine learning classifiers include LOG, GLM, kNN, RF, SVM, and GBM.
- FIG.2B shows results of exemplary trained machine learning classifier algorithms in the FIG.
- FIG.3A is a ROC plot showing performance of eight machine learning classifiers using a set of 182 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.3B shows results of exemplary trained machine learning classifier algorithms to analyze RNASeq data using the set of 182 gene features to distinguish malignant lung nodules versus benign lung nodules.
- FIG.4A is a ROC plot showing performance of machine learning classifiers using a set of 182 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.4B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.4A.
- FIG.5A is a ROC plot showing performance of eight machine learning classifiers using a set of 175 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.5B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.5A.
- Attorney Docket No.225234-718601/PCT [0076]
- FIG.6A is a ROC plot showing performance of machine learning classifiers using a set of 62 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.6B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.6A.
- FIG.7A is a ROC plot showing performance of machine learning classifiers using a set of 295 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.7B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.7A.
- FIG.8A is a ROC plot showing performance of machine learning classifiers using a set of 175 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.8B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.8A.
- FIG.9A is a cumulative fraction of lung nodules predicted by a logistic regression classifier using a set of 175 gene features.
- FIG.9B is a cumulative fraction of lung nodules predicted by a gradient boosting classifier using a set of 175 gene features.
- FIG.10 illustrates an overview of an example method 1000 for assessing a solid tumor of a subject.
- FIG.11 shows a computer system 1101 that is programmed or otherwise configured to implement methods provided herein.
- FIG.12 shows the correlation plot of the 8 clinical characteristics features listed in Table 6.
- FIG.13A-E shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of the 8 clinical characteristics features listed in Table 6, to distinguish malignant lung nodules versus benign lung nodules (in 152 patients).
- FIG.13B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 8 clinical characteristics features (Table 6), to distinguish malignant lung nodules versus benign lung nodules.
- FIG.13C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.13A.
- FIG.13D shows feature importance of the 8 clinical characteristics features (Table 6) for the 9 machine learning classifiers.
- FIG.13E shows feature importance of the 8 clinical characteristics features for all the 9 classifiers.
- the 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM.
- FIG.14A-E shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of 4 clinical characteristics features, NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), AGE, and NCNMYN (Nodule Spiculated), to distinguish malignant lung nodules versus benign lung nodules.
- FIG.14B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 4 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules.
- FIG.14C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.14A.
- FIG.14D shows feature importance of the 4 clinical characteristics features for the 9 machine learning classifiers.
- FIG.14E shows feature importance of the 4 clinical characteristics features for all the 9 classifiers.
- the 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM.
- FIG.15A-E shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of 9 clinical characteristics features (8 features in Table 6 and cancer history) to distinguish malignant lung nodules versus benign lung nodules.
- FIG.15B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 9 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules.
- FIG.15A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of 9 clinical characteristics features (8 features in Table 6 and cancer history) to distinguish malignant lung nodules versus benign lung nodules.
- FIG.15B shows Precision/Re
- FIG. 15C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.15A.
- FIG. 15D shows feature importance of the 9 clinical characteristics features for the 9 machine learning classifiers.
- FIG.15E shows feature importance of the 9 clinical characteristics features for all the 9 classifiers.
- the 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM.
- FIG.16A-D shows ROC plots showing performance of the 9 machine learning classifiers using gene expression data of the 142 gene features (Table 5), and a clinical characteristics data of 3 clinical features (NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), and AGE), to distinguish malignant lung nodules versus benign lung nodules.
- FIG.16B shows Precision/Recall curve of the 9 machine learning classifiers using gene expression data of the 142 gene features, and a clinical characteristics data of 3 clinical features, to distinguish malignant lung nodules versus benign lung nodules.
- FIG.16C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.16A.
- FIG.16D shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.16A with oversampling correction applied (e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule).
- the 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM.
- FIG.17A-E shows ROC plots showing performance of the 9 machine learning classifiers using measurement data of the 34 predictors (Table 7), to distinguish malignant lung nodules versus benign lung nodules.
- FIG.17B shows Precision/Recall curve of the 9 machine learning classifiers using measurement data of the 34 predictors, to distinguish malignant lung nodules versus benign lung nodules.
- FIG.17C shows the tabulated results of the machine learning classifiers LOG and RF corresponding to FIG.17A.
- FIG.17D shows the tabulated results of the 9 machine learning classifiers Attorney Docket No.225234-718601/PCT corresponding to FIG.17A, with oversampling correction applied (e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule).
- FIG.17 E shows feature importance of the 34 clinical characteristics features for all the 9 classifier.
- FIG.18A-C shows ROC plots showing performance of the 9 machine learning classifiers using gene expression data of the 175 gene features (Table 2), and a clinical characteristics data of 4 clinical features (NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), AGE, and NCNMYN (Nodule Spiculated), to distinguish malignant lung nodules versus benign lung nodules.
- NCNSZE nodule size
- NCNUPYN nodule in the upper lobe
- AGE AGE
- NCNMYN Nodule Spiculated
- FIG.18B shows Precision/Recall curve of the 9 machine learning classifiers using gene expression data of the 175 gene features, and a clinical characteristics data of 4 clinical features, to distinguish malignant lung nodules versus benign lung nodules.
- FIG.18C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.18A.
- the 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM.
- the subject may be a person (e.g., a patient) with a cancer, a benign solid tumor, or a malignant solid tumor; or a person that has been treated for a cancer, a benign solid tumor, or a malignant solid tumor; or a person that is being monitored for a cancer, a benign solid tumor, or a malignant solid tumor; or a person that is suspected of having a cancer, a benign solid tumor, or a malignant solid tumor; or a person that does not have or is not suspected of having a cancer, a benign solid tumor, or a malignant solid tumor.
- the term “patient,” as used herein, generally refers to a human patient.
- the patient may be a person with a cancer, a benign solid tumor, or a malignant solid tumor; or a person that has been treated for a cancer, a benign solid tumor, or a malignant solid tumor; or a person that is being monitored for a cancer, a benign solid tumor, or a malignant solid tumor; or a person that is suspected of having a cancer, a benign solid tumor, or a malignant solid tumor; or a person that does not have or is not suspected of having a cancer, a benign solid tumor, or a malignant solid tumor.
- the cancer can be pancreatic cancer, ovarian cancer, and brain cancer
- the solid tumor can be a pancreatic tumor, ovarian tumor, and brain tumor respectively.
- the term “composite data set” or “composite” refers to or is associated with a data set comprising different data sets.
- the composite data set comprises one or more data sets, each from different sources.
- Step a can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, ii) optionally clinical characteristics data of one or more clinical characteristics of the patient.
- Step b can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor.
- Step c can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor.
- Step d can include electronically outputting a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
- the gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, and the like.
- the solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney, or brain tumor.
- the solid tumor is a lung tumor.
- the solid tumor is a pancreatic tumor.
- the solid tumor is an ovarian tumor.
- the solid tumor is a kidney tumor.
- the solid tumor is a brain tumor.
- the solid tumor is a pancreatic tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is an ovarian tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is an ovarian tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is a kidney tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the solid tumor is a kidney tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a kidney tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a brain tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the solid tumor is a brain tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a brain tumor
- the at least two genes of the data set of step a are selected from the gene set capable of classifying a brain tumor as benign or malignant
- the one or more clinical characteristics of the data set of step a are selected from a group of clinical characteristics related to brain cancer.
- the gene set capable of classifying a solid tumor e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein.
- the solid tumor is a lung tumor
- the data set of step a contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the at least two lung disease- Attorney Docket No.225234-718601/PCT associated genes are selected from the group of genes listed in Table 4.
- the at least two lung disease-associated genes are selected from the group of genes listed in Table 1.
- the at least two lung disease-associated genes are selected from the group of genes listed in Table 2.
- the at least two lung disease-associated genes are selected from the group of genes listed in Table 3. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 5. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 7. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 8. In some embodiments, the at least two lung disease-associated genes, e.g.
- step a include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the at least two lung disease-associated genes e.g.
- step a include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the at least two lung disease-associated genes e.g.
- step a include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the at least two lung disease-associated genes e.g.
- step a include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the at least two lung disease-associated genes includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142 genes selected from the group of genes listed in Table 5.
- the at least two lung disease-associated genes e.g.
- step a includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7.
- the at least two lung disease- associated genes of step a are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3.
- the at least two lung disease- Attorney Docket No.225234-718601/PCT associated genes include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8..
- the at least two lung disease-associated genes of step a are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4.
- Table A provides examples of Gene ID numbers for genes listed herein in the Tables, including Tables 7 and 8, as described in OMIM® - Online Mendelian Inheritance in Man (McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD) and in the National Center for Biotechnology Information gene database (NCBI, U.S. National Library of Medicine 8600 Rockville Pike, Bethesda MD, 20894 USA), each incorporated by reference herein in their entirety. [0102] Table A. Selected Genes Example Gene ID Numbers Entrez Gene ID Predictor OMIM No.
- the solid tumor is a lung tumor, and the one or more clinical characteristics includes, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor, and the one or more clinical characteristics includes size of the tumor (e.g. lung nodule).
- the solid tumor is a lung tumor, and the one or more clinical characteristics of the patient includes age of the patient.
- the solid tumor is a lung tumor, and the one or more clinical characteristics includes presence of the nodule in the lung upper lobe.
- the solid Attorney Docket No.225234-718601/PCT tumor is a lung tumor
- the one or more clinical characteristics includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the at least two lung disease-associated genes of step a comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step a, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the at least two lung disease-associated genes of step a consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step a, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor
- the data set of step a contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the data set of step a contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of the clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- CSF cerebrospinal fluid
- the biological sample is a blood sample or any derivative thereof.
- the biological sample is PBMCs or any derivative thereof.
- the biological sample is a lung biopsy sample, or any derivative thereof.
- the biological sample is a nasal fluid sample, or any derivative thereof.
- the biological sample is a saliva sample, or any derivative thereof.
- the biological sample is a urine sample, or any derivative thereof.
- the biological sample is a stool sample, or any derivative thereof.
- the biological sample is CSF sample, or any derivative thereof.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than Attorney Docket No.225234-718601/PCT about 99%.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 % to about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 90 %, about 85 % to about
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least Attorney Docket No.225234-718601/PCT about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 % to about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, Attorney Docket No.225234-718601/PCT at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of about 80 % to about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a positive predictive value of at least about 50%, at Attorney Docket No.225234-718601/PCT least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a positive predictive value of at least about 50% at Attorney Docket No.225234-718601/PCT least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 % to about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 %
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about Attorney Docket No.225234-718601/PCT 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 % to about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 %, about 85 % to about 92
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a receiver operating characteristic (ROC) curve with an AUC of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about Attorney Docket No.225234-718601/PCT 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of about 0.8 to about 1.
- the machine learning model e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.99, about 0.85 to
- the machine learning model can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1.
- the machine learning model e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995.
- the machine learning model e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1.
- the inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the solid tumor is malignant. Higher confidence values may be correlated with a higher likelihood that the solid tumor is malignant.
- a malignant tumor may be characterized by or by having to ability to metastasize or grow invasively, which may be in contrast to benign tumor.
- Attorney Docket No.225234-718601/PCT [0112]
- the patient has a cancer. In some embodiments, the patient does not have cancer. In some embodiments, the patient is suspected of having a cancer. In some embodiments, the patient is at an elevated risk of having a cancer. In some embodiments, the patient is asymptomatic for a cancer. Cancer can be lung cancer, pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer. In some embodiments, the patient has pancreatic cancer. In some embodiments, the patient does not have pancreatic cancer. In some embodiments, the patient is suspected of having pancreatic cancer.
- the patient is at an elevated risk of having a pancreatic cancer. In some embodiments, the patient is asymptomatic for pancreatic cancer. In some embodiments, the patient has ovarian cancer. In some embodiments, the patient does not have ovarian cancer. In some embodiments, the patient is suspected of having ovarian cancer. In some embodiments, the patient is at an elevated risk of having ovarian cancer. In some embodiments, the patient is asymptomatic for ovarian cancer. In some embodiments, the patient has kidney cancer. In some embodiments, the patient does not have kidney cancer. In some embodiments, the patient is suspected of having kidney cancer. In some embodiments, the patient is at an elevated risk of having a kidney cancer.
- the patient is asymptomatic for kidney cancer. In some embodiments, the patient has brain cancer. In some embodiments, the patient does not have brain cancer. In some embodiments, the patient is suspected of having brain cancer. In some embodiments, the patient is at an elevated risk of having brain cancer. In some embodiments, the patient is asymptomatic for brain cancer. In some embodiments, the patient has a lung cancer. In some embodiments, the patient does not have lung cancer. In some embodiments, the patient is suspected of having a lung cancer. In some embodiments, the patient is at an elevated risk of having a lung cancer. In some embodiments, the patient is asymptomatic for a lung cancer.
- the method includes optionally performing biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In certain embodiments, the method includes optionally performing a biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. In some embodiments, a biopsy is performed. In some embodiments, a biopsy is not performed. The decision to perform a biopsy may be made by one of skill in the art, based on knowledge and experience, in view of the classification of the lung nodule of the patient as the malignant lung nodule or the benign lung nodule.
- the method further contains administering a treatment to the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In some embodiments, the method contains administering a treatment to the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. In some embodiments, the treatment is configured to treat a cancer of the patient. In some embodiments, the treatment is configured to reduce a severity of a cancer of the patient. In some embodiments, the treatment is configured to reduce a risk of having a cancer of the patient.
- the treatment can include one Attorney Docket No.225234-718601/PCT or more treatments of cancer.
- the cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the treatment can be treatment for the lung, pancreatic, ovarian, or brain cancer respectively.
- the method includes administering a treatment to the patient based at least in part on the classification of the pancreatic tumor of the patient as the malignant tumor or the benign tumor.
- the method comprises administering a treatment of pancreatic cancer to the patient based at least in part on the classification of the pancreatic tumor of the patient as the malignant tumor.
- the method includes administering a treatment to the patient based at least in part on the classification of the ovarian tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method include administering a treatment of ovarian cancer to the patient based at least in part on the classification of the ovarian tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the kidney tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method comprises administering a treatment of kidney cancer to the patient based at least in part on the classification of the kidney tumor of the patient as the malignant tumor.
- the method includes administering a treatment to the patient based at least in part on the classification of the brain tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method includes administering a treatment of brain cancer to the patient based at least in part on the classification of the brain tumor of the patient as malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the lung tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method comprises administering a treatment of lung cancer to the patient based at least in part on the classification of the lung tumor of the patient as the malignant tumor.
- the treatment can include one or more treatments of cancer.
- the machine learning model of step b can be a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant.
- the machine-learning model e.g. of step b, can generate the inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor, by comparing the data set to a reference data set.
- the machine-learning model can be trained using the reference data set according to the methods described herein.
- the reference data set can contain gene expression measurements of a plurality of reference biological samples from a plurality of reference subjects having solid tumor, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, data regarding whether the solid tumors of the reference subjects are benign or malignant, and optionally clinical characteristics data of one or more clinical characteristics of the reference subjects.
- a first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor.
- the reference data set contains a plurality of individual reference data sets.
- a respective individual reference data set of the plurality of individual reference data sets can contain i) gene expression measurements of a reference biological sample from a reference Attorney Docket No.225234-718601/PCT subject having a reference solid tumor of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject.
- the plurality of individual reference data sets can be obtained from the plurality of reference subjects. In some embodiments, different individual reference data sets are obtained from different reference subjects.
- each of the individual reference data set contains i) gene expression measurements of a reference biological sample from one reference subject of the at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, ii) data regarding whether the lung nodule of the one reference subject is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, and, wherein different individual reference data sets are obtained from different reference subjects.
- oversampling or undersampling correction can be made during training of the machine learning model.
- a reference data set includes a greater number of samples identified as benign and a relatively fewer number of samples identified as malignant
- the malignant samples may be oversampled to produce a data set that has equal number of benign and malignant samples.
- the reference tumor, and solid tumor can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor.
- the solid tumor is a pancreatic tumor
- the method can classify a pancreatic tumor as malignant or a benign
- the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the solid tumor is a pancreatic tumor
- the method can classify a pancreatic tumor as malignant or a benign
- the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor
- the method can classify a pancreatic tumor as malignant or a benign
- the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristic data of one or more characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is ovarian tumor
- the method can classify an ovarian tumor as malignant or a benign
- the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the solid tumor is an ovarian tumor
- the method can classify an ovarian tumor as malignant or a benign
- the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is an ovarian tumor
- the method can classify an ovarian tumor as malignant or a benign
- the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- Attorney Docket No.225234-718601/PCT In certain embodiments, the solid tumor is a kidney tumor, the method can classify a kidney tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the solid tumor is a kidney tumor
- the method can classify a kidney tumor as malignant or a benign
- the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor
- the method can classify a kidney tumor as malignant or a benign
- the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristic data of one or more characteristics selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a brain tumor
- the method can classify a brain tumor as malignant or a benign
- the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the solid tumor is a brain tumor
- the method can classify a brain tumor as malignant or a benign
- the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a brain tumor
- the method can classify a brain tumor as malignant or a benign
- the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the genes of the data set and genes of the reference data set can at least partially overlap.
- clinical characteristics of the data set and clinical characteristics of the reference data set can at least partially overlap.
- the gene set capable of classifying a solid tumor e.g.
- pancreatic, ovarian, kidney, brain tumor, as benign or malignant can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the gene set of reference data set is the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and one or more clinical characteristics of the reference data set is selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, Attorney Docket No.225234-718601/PCT 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7.
- the solid tumor is a lung tumor
- the at least two genes of the reference data set comprise the 31 genes listed in Table 7
- the one or more clinical characteristics of the reference data set comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the at least two genes of the reference data set consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of the reference data set consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21, genes selected from the group of genes listed in Table 8.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set include, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set includes size of the nodule.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set Attorney Docket No.225234-718601/PCT includes age of the patient.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set includes presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and the one or more clinical characteristics of the reference data set includes 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and the one or more clinical characteristics of the reference data set include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the genes of the data set and genes of the reference data set can at least partially overlap and/or the optional clinical characteristics of the data set and optional clinical characteristics of the reference data set can at least partially overlap.
- the reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- CSF cerebrospinal fluid
- the reference biological sample is a blood sample or any derivative thereof.
- the reference biological sample is PBMCs or any derivative thereof.
- the reference biological sample is a lung biopsy sample, or any derivative thereof.
- the reference biological sample is a nasal fluid sample, or any derivative thereof.
- the reference biological sample is a saliva sample, or any derivative thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof.
- the reference subjects can be human.
- Gene expression data can be obtained by a data analysis tool selected from the group: a BIG-CTM big data analysis tool, an I-ScopeTM big data analysis tool, a T-ScopeTM big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring TM analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope).
- GSVA Gene Set Variation Analysis
- the trained machine learning model e.g. of step b, is a supervised machine learning algorithm or an unsupervised machine learning algorithm.
- the trained machine learning model is trained using linear regression, logistic regression (LOG), Ridge Attorney Docket No.225234-718601/PCT regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof.
- the trained machine learning model is trained using LOG.
- the trained machine learning model is trained using Ridge regression. In some embodiments, the trained machine learning model is trained using Lasso regression. In some embodiments, the trained machine learning model is trained using GLM. In some embodiments, the trained machine learning model is trained using kNN. In some embodiments, the trained machine learning model is trained using SVM. In some embodiments, the trained machine learning model is trained using GBM. In some embodiments, the trained machine learning model is trained using RF. In some embodiments, the trained machine learning model is trained using NB. In some embodiments, the trained machine learning model is trained using the EN regression. In some embodiments, the trained machine learning model is trained using neural network. In some embodiments, the trained machine learning model is trained using deep learning algorithm.
- the trained machine learning model is trained using LDA. In some embodiments, the trained machine learning model is trained using DTREE. In some embodiments, the trained machine learning model is trained using ADB. [0120] In some embodiments, the method comprises determining a likelihood of the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In some embodiments, the likelihood is about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, or about 100%.
- the likelihood is least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method further comprises monitoring the solid tumor of the patient, wherein the monitoring comprises assessing the solid tumor of the patient at a plurality of time points.
- a difference in the assessment of the solid tumor of the patient among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the solid tumor of the patient, (ii) a prognosis of the solid tumor of the patient, and (iii) an efficacy or non-efficacy of a course of treatment for treating the solid tumor of the patient.
- the plurality of time points comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or 50 different time points.
- the present disclosure provides a method for determining a gene set capable of classifying a solid tumor, as benign or malignant.
- Gene expression measurements of one or more genes Attorney Docket No.225234-718601/PCT of the gene set of a biological sample (e.g. blood) from a patient can be used to classify a solid tumor of the patient, as benign or malignant without performing biopsy of the solid tumor.
- a biopsy of the solid tumor can be performed to confirm and/or follow-up the classification results obtained by using the gene expression measurement data.
- a biopsy of the solid tumor is not performed.
- the method can include any one of, any combination of, or all of steps a’, b’, c’ and d’.
- a reference data set can be obtained and/or provided.
- the reference data set can contain i) gene expression measurements of a plurality of genes of reference biological samples from reference subjects each having at least one reference solid tumor, ii) data regarding whether the reference tumors are benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subjects.
- the reference data set can contain a plurality of individual reference data sets.
- a respective individual reference data set of the plurality of individual reference data sets can contain i) gene expression measurements of the plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference tumor is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the reference subject.
- the plurality of individual reference data sets can be obtained from a plurality of reference subjects. In some embodiments, different individual reference data sets can be obtained from different reference subjects.
- each of the individual reference data set contains i) gene expression measurements of the plurality of genes of a reference biological sample from one reference subject, ii) data regarding whether the solid tumor of the one reference subject is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, wherein different individual reference data sets are obtained from different reference subjects.
- a first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor.
- a machine learning model can be trained using the reference data set to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics.
- the machine learning model can be trained using a training data set containing a first portion of the reference data set, and a validation data set containing a second portion of the reference data set.
- oversampling or undersampling correction can be made during training of the machine learning model.
- a data set includes a greater number of samples identified as benign and a relatively fewer number of samples identified as malignant
- the malignant samples may be oversampled to produce a data set that has equal number of benign and malignant samples.
- feature importance values of the plurality of genes can be determined.
- the gene set can be selected.
- the gene set can be selected as predictors that are used to train the machine learning model.
- the gene set may be selected based at least in part on feature importance values.
- the feature importance values of the genes of the gene set are within top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, Attorney Docket No.225234-718601/PCT 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, feature importance values of the plurality of genes.
- the feature importance of the genes of the gene set can have accuracy, greater than 30 %, 35 %, 40 %, 45 %, 50 %, 55 %, 60 %, 65 %, 70 %, 75 %, or 80 % or 90 %. In some embodiments, the feature importance of the genes of the gene set, can have threshold importance, greater than 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80 or 90.
- the top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the machine learning model include the genes of the gene set.
- the solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney tumor, or brain tumor.
- the solid tumor is a lung tumor.
- the solid tumor is a pancreatic tumor.
- the solid tumor is an ovarian tumor.
- the solid tumor is a kidney tumor.
- the solid tumor is a brain tumor.
- the reference tumor, and solid tumor can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor.
- the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 3.
- the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 300, 400, 500, 600, 700, 800, 900, 1000, 1100 or 1178, or any value or range there between, genes selected from
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, Attorney Docket No.225234-718601/PCT 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor
- the one or more clinical characteristics of the reference data set of step a’ include 1, 2, 3, 4, 5, 6, 7, or 8, clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least 2 genes selected from a group of genes listed in any one or more of Tables 1, 2, 3, 4, 5 and 9, and the one or more clinical characteristics of the reference data set of step a’, include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a pancreatic tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to pancreatic cancer.
- the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to pancreatic cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is an ovarian tumor
- the plurality of genes of the reference data set of step a’ contains at least 2 genes selected from a group of genes related to ovarian cancer.
- the solid Attorney Docket No.225234-718601/PCT tumor is an ovarian tumor
- the one or more clinical characteristics of the reference data set of step a’ are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is an ovarian tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to ovarian cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is a kidney tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to kidney cancer.
- the solid tumor is a kidney tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to kidney cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a brain tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to brain cancer.
- the solid tumor is a brain tumor, and the one or more clinical characteristics of the reference data of step a’, set are selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a brain tumor
- the plurality of genes of the reference data set of step a’ contains at least 2 genes selected from a group of genes related to brain cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to brain cancer.
- genes having collinear expression with correlation coefficients e.g. in non-limiting aspects > 0.7 to > 0.9
- Collinear gene expression can be measured by any suitable technique, e.g. Pearson correlation coefficient. While determining feature importance values for the plurality of genes has been described, this is merely a non-limiting illustrative example of a technique that can be practiced.
- one or more feature selection techniques are used to determine the gene set that can classify a solid tumor benign or malignant.
- Feature selection techniques can include least absolute shrinkage and selection operator (Lasso) regression, support vector machine (SVM), regularized trees, decision trees, memetic algorithm, random multinomial logit (RMNL), auto-encoding networks, submodular feature selection, recursive feature elimination, or any combination thereof.
- Lasso absolute shrinkage and selection operator
- SVM support vector machine
- RMNL random multinomial logit
- feature importance values need not be calculated for each of the genes of the plurality of genes.
- the reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- CSF cerebrospinal fluid
- the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative Attorney Docket No.225234-718601/PCT thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof. The reference subjects can be human.
- the machine learning model can be trained using a supervised machine learning algorithm or an unsupervised machine learning algorithm.
- the machine learning model e.g. of step b’, is trained using linear regression, logistic regression, Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof.
- the machine learning model is trained using logistic regression.
- the machine learning model is trained using Ridge regression. In some embodiments, the machine learning model is trained using Lasso regression. In some embodiments, the machine learning model is trained using GLM. In some embodiments, the machine learning model is trained using kNN. In some embodiments, the machine learning model is trained using SVM. In some embodiments, the machine learning model is trained using GBM. In some embodiments, the machine learning model is trained using RF. In some embodiments, the machine learning model is trained using NB. In some embodiments, the machine learning model is trained using the EN regression. In some embodiments, the machine learning model is trained using neural network. In some embodiments, the machine learning model is trained using deep learning algorithm. In some embodiments, the machine learning model is trained using LDA.
- the machine learning model is trained using DTREE. In some embodiments, the machine learning model is trained using ADB. [0126]
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 % to about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 % to about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 90 %, about 85 % to about
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of about 80 % to about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 90 %, about 85 % to about
- the Attorney Docket No.225234-718601/PCT gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 % to about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 90 %, about 85 % to
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, Attorney Docket No.225234-718601/PCT about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 % to about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 90 %, about 85 % to
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- Attorney Docket No.225234-718601/PCT [0131]
- the present disclosure provides a method for developing a trained machine learning model capable of classifying a solid tumor of a patient, as benign or malignant. The method can include any one of, any combination of, or all of steps a”, b”, c”, d” and e”.
- Step a can include obtaining and/or providing a first reference data set.
- the first reference data set can contain i) gene expression measurements of a plurality of genes of reference biological samples from reference subjects each having at least one reference solid tumor, ii) data regarding whether the reference tumors are benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subjects.
- the first reference data set can contain a plurality of first individual reference data sets.
- a respective first individual reference data set of the plurality of first individual reference data sets can contain i) gene expression measurements of the plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) clinical characteristics data of the one or more clinical characteristics of the reference subject.
- the plurality of first individual reference data sets can be obtained from a plurality of reference subjects. In some embodiments, different first individual reference data sets are obtained from different reference subjects.
- each of the first individual reference data set contains i) gene expression measurements of the plurality of genes of a reference biological sample from one reference subject, iii) data regarding whether the reference solid tumor of the one reference subject is benign or malignant and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, wherein different first individual reference data sets are obtained from different reference subjects.
- a first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor.
- a first machine learning model can be trained using the first reference data set to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics.
- the first machine learning model can be trained to infer whether the solid tumor is benign or malignant, based at least in part on the measurement data of the plurality of genes, and optionally the clinical characteristics data of the one or more clinical characteristics.
- the first machine learning model can be trained using a training data set containing a first portion of the first reference data set, and a validation data set containing a second portion of the first reference data set.
- step c’ feature importance values of one or more predictors of the first machine learning model can be determined.
- step d’ A predictors of the first machine learning model based at least in part on the feature importance values can be selected, where A can be an integer from 3 to 2000, such as 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, 250, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600,
- the A predictors can have top A feature importance values, for example, in a non-limiting aspect, A can be 10, and 10 predictors having 10 highest feature importance values can be selected.
- the feature importance of the A predictors can have an accuracy, greater than 30 %, 35 %, 40 %, 45 %, 50 %, 55 %, 60 %, 65 %, 70 %, 75 %, 80 % or 90%.
- the feature importance of the A predictors can have a threshold importance, greater than 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80 or 90.
- the A predictors form top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the first machine learning model.
- a predictors can include one or more genes, and/or optionally one or more clinical characteristics. While determining feature importance values for the one or more predictors has been described, this is merely a non-limiting illustrative example of a technique that can be practiced.
- one or more feature selection techniques are used to determine the A predictors.
- Feature selection techniques can include least absolute shrinkage and selection operator (Lasso) regression, support vector machine (SVM), regularized trees, decision trees, memetic algorithm, random multinomial logit (RMNL), auto-encoding networks, submodular feature selection, recursive feature elimination, or any combination thereof. In some of these cases, in step c”, feature importance values need not be calculated for each of the predictors of first machine learning model.
- Step e can include training a second machine learning model based at least in part on a second reference data set to obtain the trained machine learning model.
- the second reference data set can contain i) measurement data of the A predictors of the reference subjects, and ii) data regarding whether the solid tumors of the reference subjects are benign or malignant.
- the second reference data set can contain a plurality of second individual reference data sets.
- a respective second individual reference data set of the plurality of second individual reference data sets can include i) measurement data of the A predictors of a reference subject, and ii) data regarding whether the solid tumor of the reference subject is benign or malignant.
- the plurality of second individual reference data sets can be obtained from the plurality of reference subjects. In some embodiments, different second individual reference data sets are obtained from different reference subjects.
- each of the second individual reference data set contains i) measurement data of the A predictors of one reference subject, and ii) data regarding whether the solid tumor of the one reference subject is benign or malignant, wherein different second individual reference data sets are obtained from different reference subjects.
- Measurement data of the A predictors can include, gene expression measurements in the reference sample of the one or more genes features of the A predictors, and/or optionally clinical characteristics data of one or more clinical characteristics features of the A predictors.
- the trained machine learning model can infer whether a solid tumor is benign or Attorney Docket No.225234-718601/PCT malignant, based at least in part on measurement data of the A predictors.
- the one or more genes features of the A predictors can form the gene set capable of classifying a solid tumor, as benign or malignant. In certain embodiments, oversampling or undersampling correction can be made during training of the first and/or second machine learning model.
- the solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor.
- the reference tumor, and solid tumor can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor.
- the solid tumor is a lung tumor
- the plurality of genes of the first reference data set include at least 2 genes selected from a group of genes listed in Table 9.
- the solid tumor is a lung tumor
- the plurality of genes of the first reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 300, 400, 500, 600, 700, 800, 900, 1000, 1100 or 1178, or any value or range there between, genes selected from the group of genes listed
- the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 4.
- the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor, and the one or more clinical characteristics of the first reference data set include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from a group of genes listed in any one or more of Tables 1, 2, 3, 4, 5 and 9, and the one or more clinical characteristics of the first reference data set include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a pancreatic tumor
- the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to pancreatic cancer.
- the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to pancreatic cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from pancreatic cancer.
- the solid tumor is an ovarian tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to ovarian cancer. In some particular embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to ovarian cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from ovarian cancer.
- the solid tumor is a kidney tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to kidney cancer.
- the solid tumor is a kidney tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to kidney cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from kidney cancer.
- the solid tumor is a brain tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to brain cancer. In some particular embodiments, the solid tumor is a brain tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to brain cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from brain cancer. In some embodiments, genes having collinear expression with correlation coefficients (e.g.
- the solid tumor is a lung tumor
- the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes of genes selected from the group listed in Table 7, as predictors.
- the solid tumor is a lung tumor, and the A predictors can at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8, as predictors.
- the solid tumor is a lung tumor
- the A predictors include i) at least, 2 genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8and ii) optionally size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof, as predictors.
- the solid tumor is a lung tumor
- the A predictors can include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33 or 34 predictors selected from the group listed in Table 7.
- the solid tumor is a lung tumor, the A predictors comprise the 34 predictors listed in Table 7. In some embodiments, the solid tumor is a lung tumor, the A predictors consist the 34 predictors listed in Table 7.
- the reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof.
- the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof.
- Attorney Docket No.225234-718601/PCT [0134] The trained machine learning model, e.g.
- step e can infer whether a solid tumor is benign or malignant with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of about 80 % to about 100 %.
- the trained machine learning model e.g.
- a solid tumor is benign or malignant with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 92 %, about 80 % to about 94 %, about 85 % to about 95
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the trained machine learning model e.g.
- step e can infer whether a solid tumor is benign or malignant with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least Attorney Docket No.225234-718601/PCT about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the trained machine learning model e.g.
- step e can infer whether a solid tumor is benign or malignant with a sensitivity of about 80 % to about 100 %.
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 %,
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the trained machine learning model e.g.
- step e can infer whether a solid tumor is benign or malignant with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the trained machine learning model e.g.
- step e can infer whether a solid tumor is Attorney Docket No.225234-718601/PCT benign or malignant with a specificity of about 80 % to about 100 %.
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about
- the trained machine learning model e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the trained machine learning model e.g.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step a’ are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step a”’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8.
- the cancer is lung cancer, the at least two lung disease-associated genes of step a”’, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, Attorney Docket No.225234-718601/PCT RNF114, and DCTN4.
- the cancer is lung cancer, the one or more clinical characteristics of the data set of step a’”, include 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer, and the one or more clinical characteristics of the data set of step a’”, include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the cancer is lung cancer, and the data set of step a”’, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer, and the data set of step a”’, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the cancer is lung cancer, the at least two lung disease-associated genes of step a’”, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step a’” comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the cancer is lung cancer, the at least two lung disease-associated genes of step a’, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step a”” consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the cancer is pancreatic cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the cancer is pancreatic cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the cancer is pancreatic cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the cancer is ovarian cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the cancer is ovarian cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the cancer is ovarian cancer
- the data set of step a”’ contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the cancer is kidney cancer
- the data set of step a”’ contains gene expression measurements of at least two genes Attorney Docket No.225234-718601/PCT selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the cancer is kidney cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the cancer is kidney cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the cancer is brain cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the cancer is brain cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the cancer is brain cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the gene set capable of classifying a solid tumor e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g.
- the inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the tumor is malignant, where higher confidence values may be correlated with a higher likelihood that the tumor is malignant.
- a malignant tumor may be characterized by the ability to metastasize or grow invasively, which may be in contrast to benign nodules.
- the biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- the biological sample is a blood sample or any derivative thereof.
- the biological sample is PBMCs or any derivative thereof.
- the biological sample is a lung biopsy sample, or any derivative thereof.
- the biological sample is a nasal fluid sample, or any derivative thereof.
- the biological sample is a saliva sample, or any derivative thereof.
- the biological sample is a urine sample, or any derivative thereof.
- the biological sample is a stool sample, or any derivative thereof.
- the biological sample is CSF sample, or any derivative thereof.
- the method includes optionally performing biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
- the method includes optionally performing a biopsy of the solid tumor of the patient based at least in part on the classification of the Attorney Docket No.225234-718601/PCT solid tumor of the patient as the malignant solid tumor.
- the decision to perform biopsy may depend on confidence value of the inference.
- biopsy of the solid tumor of the patient is not performed.
- the machine-learning model e.g.
- step b”’ can generate the inference of whether the data set is indicative of the patient having a malignant solid tumor or a benign solid tumor, wherein the patient having malignant solid tumor may indicate the patient has cancer, and the patient having benign solid tumor may indicate the patient does not have cancer.
- the machine-learning model of step b”’ can be trained according to a method described herein, e.g. according to the methods training of the machine- learning model of step b.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of about 80 % to about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 80 % to about 90 %, about 85
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer Attorney Docket No.225234-718601/PCT with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of about 80 % to about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having Attorney Docket No.225234-718601/PCT cancer with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a specificity of about 80 % to about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of about 80 % to about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of about 80 % to about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a receiver operating characteristic (ROC) curve with an Area-Under- Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8 to about 1.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.995, about Attorney Docket No.225234
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- the treatment is configured to treat a cancer of the patient.
- Step x can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor.
- Step y can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor.
- Step z can include performing biopsy of the solid tumor based on the solid tumor being classified as the malignant solid tumor or benign tumor.
- step z can include performing biopsy of the solid tumor based on the solid tumor being classified as the malignant solid tumor. The decision to perform biopsy may depend on confidence value of the inference. In certain embodiments, biopsy of the lung nodule of the patient is not performed.
- the gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, or the like.
- the solid tumor can be a lung tumor, pancreatic tumor, kidney tumor, ovarian tumor, or a brain tumor.
- the solid tumor is a lung tumor.
- the solid tumor is a pancreatic tumor.
- the solid tumor is an ovarian tumor.
- the solid tumor is a kidney tumor.
- the solid tumor is a brain tumor [0156]
- the solid tumor is a lung tumor
- the data set of step w contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the at least two lung disease- associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the solid tumor is a lung tumor
- the at least Attorney Docket No.225234-718601/PCT two lung disease-associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of step w are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of step w are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4.
- the solid tumor is a lung tumor, and the one or more clinical characteristics of the data set of step w, includes 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group listed in Table 6.
- the solid tumor is a lung tumor, and one or more clinical characteristics of the data set of step w, includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the solid tumor is a Attorney Docket No.225234-718601/PCT lung tumor
- the data set of step w contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the data set of step w contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step w, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step w, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the solid tumor is a pancreatic tumor
- the data set of step w contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the solid tumor is a pancreatic tumor
- the data set of step w contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor
- the data set of step w contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is an ovarian tumor
- the data set of step w contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the solid tumor is an ovarian tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is an ovarian tumor
- the data set of step w contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is a kidney tumor
- the data set of step w contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the solid tumor is a kidney tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a Attorney Docket No.225234-718601/PCT kidney tumor
- the data set of step w contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a brain tumor
- the data set of step w contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the solid tumor is a brain tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a brain tumor
- the data set of step w contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the gene set capable of classifying a solid tumor e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein.
- the biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- the biological sample is a blood sample or any derivative thereof.
- the biological sample is PBMCs or any derivative thereof.
- the biological sample is a lung biopsy sample, or any derivative thereof.
- the biological sample is a nasal fluid sample, or any derivative thereof.
- the biological sample is a saliva sample, or any derivative thereof.
- the biological sample is a urine sample, or any derivative thereof.
- the biological sample is a stool sample, or any derivative thereof.
- the biological sample is CSF sample, or any derivative thereof.
- the machine learning model of step x can be a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant.
- the machine-learning model, e.g. of step x can be trained according to the methods described herein, e.g. as of the machine learning model of step b.
- Certain aspects are directed to a method for determining cancer in a patient. The method can include, any one of, any combination of, or all of steps w’, x’, y’ and z’.
- Step w’ can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from the gene set capable of classifying a solid tumor of the patient as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient.
- the gene expression measurements can be obtained by assaying the biological sample.
- Step x’ can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of the patient having or not having cancer.
- Step y’ can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the patient having or not having cancer.
- Step z’ can Attorney Docket No.225234-718601/PCT include electronically outputting a report indicating the patient has, or does not have cancer.
- the gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, or the like.
- the cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the solid tumor can be a pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor respectively.
- the cancer is lung cancer, and the solid tumor is a lung tumor.
- the cancer is pancreatic cancer, and the solid tumor is a pancreatic tumor.
- the cancer is ovarian cancer, and the solid tumor is an ovarian tumor.
- the cancer is kidney cancer, and the solid tumor is a kidney tumor.
- the cancer is brain cancer, and the solid tumor is a brain tumor.
- the cancer is lung cancer, and the data set of step w’, contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142 genes selected from Attorney Docket No.225234-718601/PCT the group of genes listed in Table 5.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7.
- the cancer is lung cancer
- the at least two lung disease- associated genes of step w’ are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3.
- the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8.
- the cancer is lung cancer, the at least two lung disease-associated genes of step w, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4.
- the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group listed in Table 6.
- the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes size of the nodule.
- the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes age of the patient.
- the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes presence of the nodule in the lung upper lobe.
- the cancer is lung cancer
- the one or more clinical characteristics of the dataset of step w’ include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the cancer is lung cancer
- the data set of step w’ contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer
- the data set of step w’ contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of the clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the cancer is lung cancer, and the at least two lung disease- associated genes of step w’, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step w’, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the cancer is lung cancer, and the at least two lung disease-associated genes of step w’, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step w’, consist of the size of the nodule, age of the patient, and the presence of the Attorney Docket No.225234-718601/PCT nodule in the lung upper lobe.
- the cancer is pancreatic cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the cancer is pancreatic cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the cancer is pancreatic cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the cancer is ovarian cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the cancer is ovarian cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the cancer is ovarian cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the cancer is kidney cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the cancer is kidney cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the cancer is kidney cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the cancer is brain cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the cancer is brain cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the cancer is brain cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the gene set capable of classifying a solid tumor e.g.
- the biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any Attorney Docket No.225234-718601/PCT derivative thereof.
- PBMCs peripheral blood mononuclear cells
- CSF cerebrospinal fluid
- the biological sample is a blood sample or any derivative thereof.
- the biological sample is PBMCs or any derivative thereof.
- the biological sample is a lung biopsy sample, or any derivative thereof.
- the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof.
- the method can determine whether the patient has or does not have cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with an accuracy of about 80 % to about 100 %.
- the method can determine whether the patient has or does not have cancer with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 92 %, about 80 % to about 94 %, about 85 %
- the method can determine whether the patient has or does not have cancer with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can determine whether the patient has or does not have cancer with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can determine whether the patient has or does not have cancer with an accuracy of at most Attorney Docket No.225234-718601/PCT about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can determine whether the patient has or does not have cancer with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with a sensitivity of about 80 % to about 100 %.
- the method can determine whether the patient has or does not have cancer with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 92 %, about 80 % to about 94 %, about 85
- the method can determine whether the patient has or does not have cancer with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can determine whether the patient has or does not have cancer with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can determine whether the patient has or does not have cancer with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can determine whether the patient has or does not have cancer with an specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, Attorney Docket No.225234-718601/PCT at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with a specificity of about 80 % to about 100 %.
- the method can determine whether the patient has or does not have cancer with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 92 %, about 80 % to about 94 %, about 85
- the method can determine whether the patient has or does not have cancer with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can determine whether the patient has or does not have cancer with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can determine whether the patient has or does not have cancer with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can determine whether the patient has or does not have cancer with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with a positive predictive value of about 80 % to about 100 %.
- the method can determine whether the patient has or does not have cancer with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to Attorney Docket No.225234-718601/PCT about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100
- the method can determine whether the patient has or does not have cancer with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with a negative predictive value of about 80 % to about 100 %.
- the method can determine whether the patient has or does not have cancer with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 100 %, about 90 % to about 92 %, about 80 % to about 94 %, about
- the method can determine whether the patient has or does not have cancer with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the method can determine whether the patient has or does not have cancer with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the method can determine whether the patient has or does not have cancer with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model e.g.
- step x’ can infer whether the data set is indicative of the patient having or not having cancer with a receiver operating characteristic (ROC) curve with an AUC of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- the machine learning model e.g.
- step x’ can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8 to about 1.
- the machine learning model, e.g. of step x’ can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.995, about 0.85 to about 0.99, about 0.85
- the machine learning model e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1.
- the machine learning model e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995.
- the machine learning model e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1.
- the inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the patient has cancer. Higher confidence values may be correlated with a higher likelihood that the patient has is cancer.
- the machine-learning model e.g. of step x’, can generate inference of whether the data set is indicative of the patient having a malignant solid tumor or a benign solid tumor, wherein the patient having malignant solid tumor may indicate that the patient has cancer, and patient having benign solid tumor may indicate that the patient does not have lung cancer.
- the machine-learning model can be trained according to a method described herein, e.g. according to the methods of training of the machine learning model of step b.
- the present disclosure provides a computer system for assessing a solid tumor of a subject, containing: a database or other suitable data storage system that is configured to store a dataset containing a) gene expression measurements of a biological sample obtained or derived from the subject, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) analyze the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor; (ii) electronically output a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor.
- Computer-implemented methods as described herein may be executed on computer systems such as those described above.
- a computer system may comprise one or more processors and one or more memory units that collectively store computer- readable executable instructions that, as a result of execution, cause the one or more processors to Attorney Docket No.225234-718601/PCT collectively perform the programmed steps described above.
- a computer system as described herein may comprise an assay device communicatively coupled to a personal computer.
- the data set can be a data set (e.g. of step a) described herein.
- the biological sample can be a biological sample described herein.
- the computer system further comprises an electronic display operatively coupled to the one or more computer processors, wherein the electronic display comprises a graphical user interface that is configured to display the report.
- the present disclosure provides one or more non-transitory computer readable media collectively containing machine-executable code that, upon execution by one or more computer processors, causes the one or more computer processors to perform a method for assessing a solid tumor of a subject, the method containing: (a) obtaining a data set containing a) gene expression measurements of a biological sample obtained or derived from a subject, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject; (b) analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor; and (c) electronically outputting a report indicative of the classification of the solid tumor of the subject as the
- the data set can be a data set (e.g. of step a) described herein.
- the biological sample can be a biological sample described herein.
- the disclosure includes the use of any inventive method, system, or other composition described herein, including a gene set determined using the inventive methods, for diagnosing a cancer, or for determining and/or administering a treatment of a patient or subject having a cancer.
- Aspect 1 is directed to a method for assessing a lung nodule of a subject, comprising: (a) assaying a biological sample obtained or derived from the subject to produce a data set comprising gene expression measurements of the biological sample at each of a plurality of lung disease-associated genomic loci, wherein the plurality of disease-associated genomic loci comprises at least one gene selected from the group listed in in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8; (b) analyzing the data set to classify the lung nodule of the subject as a malignant lung nodule or a benign lung nodule; and (c) electronically outputting a report indicative of the classification of the lung nodule of the subject as the malignant lung nodule or the benign lung nodule.
- Aspect 2 is directed to the method of aspect 1, wherein the plurality of disease-associated genomic loci comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, Attorney Docket No.225234-718601/PCT 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295 genes selected from the group listed in Table 4.
- Aspect 3 is directed to the method of aspect 1 or 2, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 4 is directed to the method of any one of aspects 1 to 3, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with an sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 5 is directed to the method of any one of aspects 1 to 4, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with an specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 6 is directed to the method of any one of aspects 1 to 5, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 9
- Aspect 7 is directed to the method of any one of aspects 1 to 6, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 9
- Aspect 8 is directed to the method of any one of aspects 1 to 7, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with an Area-Under- Attorney Docket No.225234-718601/PCT Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- AUC Area-Under- Attorney Docket No.225234-718601/PCT Curve
- Aspect 9 is directed to the method of any one of aspects 1 to 8, wherein the subject has a lung cancer.
- Aspect 10 is directed to the method of any one of aspects 1 to 8, wherein the subject is suspected of having a lung cancer.
- Aspect 11 is directed to the method of any one of aspects 1 to 8, wherein the subject is at elevated risk of having a lung cancer.
- Aspect 12 is directed to the method of any one of aspects 1 to 8, wherein the subject is asymptomatic for a lung cancer.
- Aspect 36 is directed to a computer system for assessing a lung nodule of a subject, comprising: a database that is configured to store a dataset comprising gene expression data, wherein the gene expression data is obtained by assaying a biological sample obtained or derived from the subject to produce gene expression measurements of the biological sample at each of a plurality of lung disease- associated genomic loci, wherein the plurality of disease-associated genomic loci comprises at least one gene selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) analyze the data set to classify the lung nodule of the subject as a malignant lung nodule or a benign lung nodule; (ii) electronically output a report indicative of the classification of the lung nodule of the subject as the malignant lung nodule or the benign lung nodule.
- Aspect 37 is directed to the computer system of aspect 36, further comprising an electronic display operatively coupled to the one or more computer processors, wherein the electronic display comprises a graphical user interface that is configured to display the report.
- Aspect 38 is directed to one or more non-transitory computer readable media collectively comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing a lung nodule of a subject, the method comprising: (a) assaying a biological sample obtained or derived from the subject to produce a data set comprising gene expression measurements of the biological sample at each of a plurality of lung disease-associated genomic loci, wherein the plurality of disease-associated genomic loci comprises at least one gene selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8; (b) analyzing the data set to classify the lung nodule of the subject as a malignant lung nodule or a benign lung nodule; and (c) electronically
- Aspect 39 is directed to a method for assessing a lung nodule of a patient, the method comprising: a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant lung nodule or a benign lung nodule; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant lung nodule or the benign lung nodule;
- PBMCs peripheral blood mononuclear cells
- the data set of Aspect 39 comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 3. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4.
- the data set of Aspect 39 comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 8.
- the data set of Aspect 39 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 39 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical Attorney Docket No.225234-718601/PCT characteristics listed in Table 6.
- the data set of Aspect 39 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 3, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 39 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 39 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 39 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 39 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 8, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- Aspect 40 is directed to the method of aspect 39, wherein the at least two lung disease-associated genes are selected from the group of genes listed in Table 7.
- Aspect 41 is directed to the method of aspects 39 or 40, wherein the one or more clinical characteristics comprises size of the nodule, age of the patient, and presence of the nodule in the lung upper lobe.
- Aspect 42 is directed to the method of any one of aspects 39 to 41, wherein the machine-learning model is developed using a linear regression, a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof.
- a linear regression a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier,
- Aspect 43 is directed to the method of any one of aspects 39 to 42, wherein the patient has lung cancer.
- Aspect 44 is directed to the method of any one of aspects 39 to 42, wherein the patient does not have lung cancer.
- Attorney Docket No.225234-718601/PCT [0222]
- Aspect 45 is directed to the method of any one of aspects 39 to 42, wherein the patient is at an elevated risk of having lung cancer.
- Aspect 46 is directed to the method of any one of aspects 39 to 43 and 45, wherein the patient is asymptomatic for lung cancer.
- Aspect 47 is directed to the method of any one of aspects 39 to 43, 45 and 46, further comprising administering a treatment based on the patient’s nodule being classified as a malignant nodule.
- Aspect 48 is directed to the method of aspect 47, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof.
- Aspect 49 is directed to the method of any one of aspects 39 to 48, wherein the inference includes a confidence value between 0 and 1 that the lung nodule is malignant.
- Aspect 51 is directed to the method of any one of aspects 39 to 50, wherein the at least two lung disease-associated genes comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31, genes selected from the group of genes listed in Table 7.
- Aspect 52 is directed to the method of any one of aspects 39 to 51, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 53 is directed to the method of any one of aspects 39 to 52, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 54 is directed to the method of any one of aspects 39 to 53, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about Attorney Docket No.225234-718601/PCT 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 55 is directed to the method of any one of aspects 39 to 54, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 56 is directed to the method of any one of aspects 39 to 55, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 9
- Aspect 57 is directed to the method of any one of aspects 39 to 56, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- Aspect 58 is directed to a system for assessing a lung module of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample of a patient of a plurality of lung disease-associated genes, selected from the group of genes listedin any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant lung nodule or a benign lung nodule; receive, as an output of the machine-learning model, the inference indicating whether
- the data set of Aspect 58 comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 3. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4.
- the data set of Aspect 58 comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 8.
- the data set of Aspect 58 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 58 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 58 comprises) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 3, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 58 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 58 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 58 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung Attorney Docket No.225234-718601/PCT disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 58 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 8, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- Aspect 59 is directed to a non-transitory computer-readable medium storing executable instructions for assessing a lung nodule of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample of a patient of a plurality of lung disease-associated genes, selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant lung nodule or a benign lung nodule; receive, as an output of the machine-learning model, the inference indicating whether the
- the data set of Aspect 59 comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 3. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4.
- the data set of Aspect 59 comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7. In some embodiments, the data set of Aspect 59, Attorney Docket No.225234-718601/PCT comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 8.
- the data set of Aspect 59 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 59 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 59 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 3, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 59 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 59 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 59 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 59 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 8, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- Aspect 60 is directed a method for determining a gene set capable of classifying a lung nodule benign or malignant without performing biopsy, the method comprising: obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a lung nodule, ii) data regarding whether the lung nodule of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the reference subject, wherein the reference biological Attorney Docket No.225234-718601/PCT sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on
- the respective individual reference data set of Aspect 60 comprises i) gene expression measurements of the plurality of genes of the reference biological sample, and ii) data regarding whether the lung nodule of the reference subject is benign or malignant; and the machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes.
- the respective individual reference data set of Aspect 60 comprises i) gene expression measurements of the plurality of genes of the reference biological sample, ii) data regarding whether the lung nodule of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the reference subject; and the machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes and the one or more clinical characteristics.
- Aspect 61 is directed to the method of aspect 60, wherein the plurality of genes comprises at least 2 genes selected from the group of genes listed in Table 9.
- Aspect 62 is directed a method for developing a trained machine learning model capable of inferring whether a lung nodule of a patient is benign or malignant, the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a lung nodule, ii) data regarding whether the lung nodule of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics listed in Table 6 of the reference subject, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a lung nodule is benign or malignant, based
- the respective first individual reference data set of Aspect 62 comprises i) gene expression measurements of the plurality of genes of the reference biological sample, and ii) data regarding whether the lung nodule of the reference subject is benign or malignant; and the first machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes.
- the respective first individual reference data set of Aspect 60 comprises i) gene expression measurements of the plurality of genes of the reference biological sample, ii) data regarding whether the lung nodule of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the reference subject; and the first machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes and the one or more clinical characteristics.
- Aspect 63 is directed to the aspect of 62, wherein the plurality of genes comprises at least 2 genes selected from the group of genes listed in Table 9.
- Aspect 64 is directed to the method of any one of aspects 62 to 63, wherein the A predictors have top 5 to 200 feature importance values.
- Aspect 65 is directed to the method of any one of aspects 62 to 64, wherein the trained machine learning model has an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 66 is directed to the method of any one of aspects 62 to 65, wherein the trained machine learning model has an sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about Attorney Docket No.225234-718601/PCT 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 67 is directed to the method of any one of aspects 62 to 66, wherein the trained machine learning model has an specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 68 is directed to the method of any one of aspects 62 to 67, wherein the trained machine learning model has a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 69 is directed to the method of any one of aspects 62 to 68, wherein the trained machine learning model has a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 70 is directed to the method of any one of aspects 62 to 69, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- Aspect 71 is directed to the method of any one of aspects 62 to 70, wherein the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof.
- a linear regression a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Baye
- Aspect 72 is directed to a method for assessing a lung nodule of a patient, the method comprising: (a) obtaining a data set comprising measurement data of the patient of one or more of the A predictors of any one of aspects 62 to 64; Attorney Docket No.225234-718601/PCT (b) providing the data set as an input to a trained machine-learning model trained according to the methods of any one of claims 62 to 71 to generate an inference of whether the data set is indicative a malignant lung nodule or a benign lung nodule; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant lung nodule or the benign lung nodule; and (d) electronically outputting a report classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule.
- Aspect 73 is directed to the method of aspect 72, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof.
- Aspect 74 is directed to the method of any one of aspects 72 to 73, wherein the patient has lung cancer.
- Aspect 75 is directed to the method of any one of aspects 72 to 73, wherein the patient does not have lung cancer.
- Aspect 76 is directed to the method of any one of aspects 72 to 73, wherein the patient is at elevated risk of having lung cancer.
- Aspect 77 is directed to the method of any one of aspects 72 to 74 and 76, wherein the patient is asymptomatic for lung cancer.
- Aspect 78 is directed to the method of any one of aspects 72 to 74, 76 and 77, further comprising administering a treatment based on the patient’s lung nodule being classified as a malignant nodule.
- Aspect 79 is directed to the method of aspect 78, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof.
- Aspect 80 is directed to a method for treating lung cancer in a patient having a lung nodule, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; Attorney Docket No.225234-718601/PCT (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant lung nodule or a benign lung nodule; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant lung nodule or the benign lung no
- the data set of Aspect 80 comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 3. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4.
- the data set of Aspect 80 comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 8.
- the data set of Aspect 80 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 80 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 80 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 3, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 80 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4, and ii) clinical characteristics data of one or more clinical characteristics Attorney Docket No.225234-718601/PCT of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 80 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 80 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- the data set of Aspect 80 comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 8, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6.
- Aspect 81 is directed to the method of aspect 80, wherein the at least two lung disease-associated genes are selected from the group of genes listed in Table 7.
- Aspect 82 is directed to the method of aspects 80 or 81, wherein the one or more clinical characteristics comprises size of the nodule, age of the patient, and presence of the nodule in the lung upper lobe.
- Aspect 83 is directed to the method of any one of aspects 80 to 82, wherein the machine-learning model is developed using a linear regression, a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof.
- a linear regression a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier
- Aspect 84 is directed to the method of any one of aspects 80 to 83, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof.
- Aspect 85 is directed to the method of any one of aspects 80 to 84, wherein the inference includes a confidence value between 0 and 1 that the lung nodule is malignant.
- Aspect 86 is directed to the method of any one of aspects 80 to 85, wherein the at least two lung disease-associated genes comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, genes selected from the group of genes listed in Table 4.
- Aspect 87 is directed to the method of any one of aspects 80 to 86, wherein the at least two lung disease-associated genes comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31, genes selected from the group of genes listed in Table 7.
- Aspect 88 is directed to the method of any one of aspects 80 to 87, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 89 is directed to the method of any one of aspects 80 to 88, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 90 is directed to the method of any one of aspects 80 to 89, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 91 is directed to the method of any one of aspects 80 to 90, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 92 is directed to the method of any one of aspects 80 to 91, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 93 is directed to the method of any one of aspects 80 to 92, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least Attorney Docket No.225234-718601/PCT about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- Aspects 94 is directed to a method for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy, the method comprising: (a) obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a solid tumor is benign or malignant based at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics;
- the respective individual reference data set of Aspect 94 comprises i) gene expression measurements of the plurality of genes of the reference biological sample from the reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant and iii) clinical characteristics data of one or more clinical characteristics of the reference subject, and the machine learning model is trained to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes, and clinical characteristics data of the one or more clinical characteristics.
- the respective individual reference data set of Aspect 94 comprises i) gene expression measurements of the plurality of genes of the reference biological sample from the reference subject having a reference solid tumor, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and the machine learning model is trained to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes.
- Aspect 95 is directed to the aspect of 94, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney tumor or brain tumor.
- Aspect 96 is directed to the aspect of 94, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer.
- Attorney Docket No.225234-718601/PCT [0280]
- Aspect 97 is directed to the aspect of 94, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer.
- Aspect 98 is directed to the aspect of 94, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer.
- Aspect 99 is directed to the aspect of 94, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer.
- Aspect 100 is directed to the aspect of 94 or 96, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
- Aspect 101 is directed to the aspect of 94 or 97, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer.
- Aspect 102 is directed to the aspect of 94 or 98, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer.
- Aspect 103 is directed to the aspect of 94 or 99, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer.
- Aspect 104 is directed to a method for developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant, the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and
- the respective first individual reference data set of aspect 104 comprises i) gene expression measurements of the plurality of genes of the reference biological sample from the reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant and iii) clinical characteristics data of one or more clinical characteristics of the reference subject, and the first machine learning model is trained to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes, and clinical characteristics data of the one or more clinical characteristics.
- the respective individual reference data set of aspect 104 comprises i) gene expression measurements of a plurality of genes of the reference biological sample from the reference subject having the reference solid tumor, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and the machine learning model is trained to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes.
- Aspect 105 is directed to the method of aspect 104, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney tumor or brain tumor.
- Aspect 106 is directed to the method of aspect 104, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer.
- Aspect 107 is directed to the method of aspect 104, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer.
- Aspect 108 is directed to the method of aspect 104, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer.
- Aspect 109 is directed to the method of aspect 104, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer.
- Attorney Docket No.225234-718601/PCT [0294]
- Aspect 110 is directed to the method of aspect 104 or 106, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
- Aspect 111 is directed to the method of aspect 104 or 107, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer.
- Aspect 112 is directed to the method of aspect 104 or 108, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer.
- Aspect 113 is directed to the method of aspect 104 or 109, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer.
- Aspect 114 is directed to the method of any one of aspects 104 to 113, wherein the A predictors have top 5 to 200 feature importance values.
- Aspect 115 is directed to the method of any one of aspects 104 to 114, wherein the trained machine learning model has an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 116 is directed to the method of any one of aspects 104 to 115, wherein the trained machine learning model has a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 117 is directed to the method of any one of aspects 104 to 116, wherein the trained machine learning model has a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 118 is directed to the method of any one of aspects 104 to 117, wherein the trained machine learning model has a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, Attorney Docket No.225234-718601/PCT at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 119 is directed to the method of any one of aspects 104 to 118, wherein the trained machine learning model has a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 120 is directed to the method of any one of aspects 104 to 117, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- Aspect 121 is directed to the method of any one of aspects 104 to 120, wherein the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof.
- a linear regression a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB)
- Aspect 122 is directed a method for assessing a solid tumor of a patient, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of aspect 94, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) electronically outputting a report classifying the solid tumor of the patient as the malignant or the benign solid tumor.
- PBMCs peripheral blood mononuclear cells
- the data set of Aspect 122 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of Aspect 94.
- the data set of Aspect 122 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of Aspect 94, and ii) clinical characteristics data of one or more clinical characteristics of the patient.
- Aspect 123 is directed to the method of aspect 122, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney tumor, or brain tumor.
- Aspect 124 is directed to the method of aspect 122, wherein the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set of any one of aspects 94, 96, or 100, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant.
- Aspect 125 is directed to the method of aspect 122, wherein the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set of any one of aspects 94, 97, or 101, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant.
- Aspect 126 is directed to the method of aspect 122, wherein the solid tumor is a brain tumor, and the at least 2 genes selected are from the gene set of any one of aspects 94, 98, or 102, wherein the gene set is capable of classifying the brain tumor as benign or malignant.
- Aspect 127 is directed to the method of aspect 122, wherein the solid tumor is a kidney tumor, and the at least 2 genes selected are from the gene set of any one of aspects 94, 99, or 103, wherein the gene set is capable of classifying the kidney tumor as benign or malignant.
- Aspect 128 is directed to the method of aspect 122 or 124, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
- Aspect 129 is directed to the method of aspect 122 or 125, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer.
- Aspect 130 is directed to the method of aspect 122 or 126, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer.
- Aspect 131 is directed to the method of aspect 122 or 127, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer.
- Aspect 132 is directed to the method of any one of aspects 122 to 131, wherein the machine- learning model is trained according to the method any one of aspects 104 to 121.
- Aspect 133 is directed to the method of any one of aspects 122 to 132, wherein the patient has cancer.
- Attorney Docket No.225234-718601/PCT [0319]
- Aspect 134 is directed to the method of any one of aspects 122 to 132, wherein the patient does not have cancer.
- Aspect 135 is directed to the method of any one of aspects 122 to 132, wherein the patient is at an elevated risk of having cancer.
- Aspect 136 is directed to the method of any one of aspects 122 to 133, and 135, wherein the patient is asymptomatic for cancer.
- Aspect 137 is directed to the method of any one of aspects 133 to 136, wherein the cancer is pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer.
- Aspect 138 is directed to the method of any one of aspects 122 to 137, further comprising administering a treatment based on the patient’s solid tumor being classified as malignant.
- Aspect 139 is directed to the method of aspect 138, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof.
- Aspect 140 is directed to the method of any one of aspects 122 to 139, wherein the inference includes a confidence value between 0 and 1 that the solid tumor is malignant.
- Aspect 141 is directed to the method of any one of aspects 122 to 140, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 143 is directed to the method of any one of aspects 122 to 142, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 144 is directed to the method of any one of aspects 122 to 143, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least Attorney Docket No.225234-718601/PCT about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 145 is directed to the method of any one of aspects 122 to 144, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Aspect 146 is directed to the method of any one of aspects 122 to 145, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- Aspect 147 is directed to a method for treating cancer in a patient having a solid tumor, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) administering a treatment based on the patient’s tumor being classified as a malignant tumor [0333]
- the data set of aspect 147 comprises
- the data set of aspect 147 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) clinical characteristics data of one or more clinical characteristics of the patient.
- Aspect 148 is directed to the method of aspect 147, wherein the cancer is a pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer.
- Aspect 149 is directed to a system for assessing a solid tumor of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign
- PBMCs peripheral blood mononuclear cells
- the data set of aspect 149 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103. In certain embodiments, the data set of aspect 149 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) clinical characteristics data of one or more clinical characteristics of the patient.
- Aspect 150 is directed to a non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor; Attorney Docket No.225234-718601/PCT receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate
- the data set of aspect 150 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103. In certain embodiments, the data set of aspect 150 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) clinical characteristics data of one or more clinical characteristics of the patient. Numbered embodiments 1.
- a method for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy comprising: a) obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a solid tumor is benign or malignant based at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics; c) determining feature importance values of the
- the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor.
- the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer.
- the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer Attorney Docket No.225234-718601/PCT 5.
- the method of embodiment 1, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer. 6.
- the method of embodiment 1 or 4 wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer.
- 9. The method of embodiment 1 or 5 wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. 10.
- a method for developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is
- the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor.
- the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer.
- the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer.
- the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer. 16.
- the trained machine learning model has a sensitivity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 24.
- the trained machine learning model has a specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least Attorney Docket No.225234-718601/PCT about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 25.
- the trained machine learning model has a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- ROC receiver operating characteristic
- AUC Area-Under-Curve
- a linear regression a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression
- SVM support vector machine
- GBM gradient boosted machine
- kNN k nearest neighbors
- GLM generalized linear model
- NB na ⁇ ve Bayes
- NB na ⁇ ve Bayes
- a method for assessing a solid tumor of a patient comprising: a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of embodiment 1, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and Attorney Docket No.225234-718601/PCT d) electronically outputting a report classifying the solid tumor of the patient as the malignant or the benign solid tumor.
- PBMCs peripheral blood mononuclear cells
- the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor.
- the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set any one of embodiments 1, 3, or 7, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant.
- the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set of embodiments 1, 4, or 8, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant. 33.
- 35. The method of embodiment 29 or 31, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. 36.
- any one of embodiments 29 to 48 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 50.
- any one of embodiments 29 to 49 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 51.
- any one of embodiments 29 to 50 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 52.
- any one of embodiments 29 to 51 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 53.
- ROC receiver operating characteristic
- AUC Area-Under-Curve
- a method for treating cancer in a patient having a solid tumor comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) administering a treatment based on the patient’s tumor being classified as a malignant tumor.
- PBMCs peripheral blood mononuclear cells
- a system for assessing a solid tumor of a patient comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the mal
- PBMCs peripheral blood mononuclear cells
- a non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
- a method for obtaining a gene set capable of classifying whether a patient has cancer comprising: (a) providing a dataset as an input to a machine learning classifier, said dataset comprises or is derived from gene expression measurements from a plurality of reference samples, of genes listed in a plurality of gene modules, wherein the plurality of reference samples comprises a first plurality of reference samples obtained or derived from reference subjects having cancer, and a second plurality of reference samples obtained or derived from reference subjects not having cancer; (b) performing feature selection to select a subset of gene modules from the plurality of gene modules, wherein the plurality of the gene modules form the features of the machine learning classifier; and (c) selecting differentially expressed genes from the genes listed in the subset of gene modules to obtain the gene set capable of classifying whether the patient has cancer.
- the plurality of gene modules are obtained by a method comprising: providing an initial data set comprising gene expression measurement from the plurality of reference samples, of genes of an initial gene-set; selecting M genes from the initial gene set, wherein said M genes are M variably expressed genes of the initial data set, and wherein M is an integer; and clustering the M genes to obtain the plurality of gene modules.
- the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 110, 120, 130, 140, 150, 151, 152, 153, 154, 155, 156, 157,
- the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c).
- any one of embodiments 67 to 70 wherein the method classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 72.
- any one of embodiments 67 to 71 wherein the method classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 67 to 72 wherein the method classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 67 to 73 wherein the method classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- analyzing the patient data set comprises providing the patient data set as an input to a machine-learning model trained to generate an inference indicative of whether the patient has cancer based on the patient data set.
- the method of embodiment 75 further comprising: a) receiving as an output of the machine-learning model the inference; and b) electronically outputting a report classifying whether the patient has cancer based on the inference.
- the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof.
- ROC receiver operating characteristic
- AUC Area-Under-Curve
- the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer.
- the cancer is a blood cancer.
- the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post-transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. 86.
- a method for classifying whether a patient has cancer comprising: providing a patient data set comprising or derived from gene expression measurements of at least 2 genes selected from genes within the gene set obtained in step (c) of any one of embodiments 58-66 as an input to a machine learning model trained to generate an inference of whether the patient data set is indicative of the patient having cancer; receiving, as an output of the machine-learning model the inference; and electronically outputting a report classifying whether the patient has cancer based on the inference, wherein the gene expression measurements are obtained from a biological sample obtained or derived from the patient.
- the patient dataset comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, Attorney Docket No.225234-718601/PCT 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106,
- the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c).
- GSVA gene set variation analysis
- GSEA gene set enrichment analysis
- MEGENA multiscale embedded gene co-expression network analysis
- WGCNA weighted gene co-expression network analysis
- differential expression analysis Z-score, log2 expression analysis, or any combination thereof.
- any one of embodiments 86 to 90 wherein the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof.
- LOG logistic regression
- Ridge regression Lasso regression
- elastic net elastic net
- SVM support vector machine
- GBM gradient boosted machine
- kNN k nearest neighbors
- GLM generalized linear model
- NB na ⁇ ve Bayes
- NB na ⁇ ve Bayes
- neural network Random Forest
- RF Random Forest
- LDA linear discriminant analysis
- DTREE decision tree learning
- ADB adaptive boosting
- any one of embodiments 86 to 91 wherein the method classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 86 to 92 wherein the method classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 86 to 93 wherein the method classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at Attorney Docket No.225234-718601/PCT least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 95.
- any one of embodiments 86 to 94 wherein the method classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 86 to 95 wherein the method classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- ROC receiver operating characteristic
- AUC Area-Under-Curve
- the cancer is a solid cancer.
- the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sar
- the method of any one of embodiments 86 to 98, wherein the cancer is a blood cancer.
- the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post- transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia.
- the method of embodiment 103, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof.
- the method of embodiment 80 or 103, wherein the treatment comprises ABVD, AC, ATO, ATRA, Abemaciclib (Verzenois), Abiraterone (Zytiga), Abraxane, Abstral, Acalabrutinib, Actimorph, Actinomycin D, Actiq, Adriamycin, Afatinib (Giotrif), Afinitor, Aldara, Aldesleukin (IL-2, Proleukin or interleukin 2), Alectinib, Alectinib (Alecensa), Alemtuzumab (Campath, MabCampath), Alkeran, Amsacrine (Amsidine, m-AMSA), Amsidine, Anastrazole (Arimidex), Apalutamide, Ara C, Arimidex, Aromasin, Arsenic trioxide (Trisenox,
- Biopsy of the solid tumor can be relatively difficult to perform.
- Non limiting examples of solid tumors for which biopsy is relatively difficult to perform can include tumors for which performing biopsy and/or collecting samples for biopsy require invasive and/or painful surgery.
- the methods and systems of the invention can be used to analyze solid tumors for which obtaining a biopsy is surgically difficult, clinically invasive, dangerous for the patient, or a combination thereof.
- the methods and systems of the invention can be used as described to classify a solid tumor as malignant or benign, with a high accuracy, sensitivity, specificity, positive predictive value, negative predictive value, or a combination thereof, without the need for obtaining a biopsy.
- a solid tumor appropriate for analysis using the methods and systems of the present invention can be identified by one of skill in the art as desired.
- the solid tumor is a sarcoma, carcinoma, or lymphoma.
- the solid tumor is a lung, pancreatic, ovarian, kidney or brain tumor.
- the machine learning (ML) methods of the current disclosure can classify the tumor.
- the biological sample can be a blood sample.
- the methods can have relatively high accuracy, specificity, selectivity, positive predictive value, and/or negative predictive value.
- Attorney Docket No.225234-718601/PCT predictive power e.g. accuracy, specificity, selectivity, positive predictive value, and/or negative predictive value
- a treatment of cancer can be administered based on the results from machine learning classification.
- One of the potential benefits of certain embodiments of the current disclosures include is that a biopsy can be avoided in cases where the ML classification model outputs a high confidence that a solid tumor is benign or malignant.
- the benefit here is that in conventional techniques, a biopsy is always performed as it is the only way to determine whether the solid tumor is benign or malignant.
- biopsy procedure carries inherent risks, and the risks for a biopsy may outweigh the benefits for some patients but not others, based on their individual circumstances.
- the ML model can be used to better inform the clinician of whether the benefits of getting the biopsy outweigh the risks of a biopsy procedure (e.g., a situation where a biopsy can be avoided, can include where a patient is (1) at heightened risk of complications of a biopsy due to some other health-related condition or the location of the tumor and (2) the blood sample indicates that the solid tumor has high likelihood of being benign or malignant).
- a biopsy procedure e.g., a situation where a biopsy can be avoided, can include where a patient is (1) at heightened risk of complications of a biopsy due to some other health-related condition or the location of the tumor and (2) the blood sample indicates that the solid tumor has high likelihood of being benign or malignant).
- the ability to avoid an unnecessary biopsy can also be considered a technical advantage and/or practical benefit.
- the blood sample can be a whole blood sample, blood cells, serum, plasma, or any combination thereof.
- Tables 1, 2, 3, 4, 5, and 9 list lung disease-associated gene.
- Table 7 lists 31 lung disease-associated gene and 3 clinical characteristics.
- Table 8 lists 21 lung disease-associated gene and 1 clinical characteristics.
- Table 6 lists 8 clinical characteristics.
- the present disclosure provides a method for assessing a solid tumor of a patient.
- the method can include, any one of, any combination of, or all of steps a, b, c and d.
- Step a can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, ii) optionally clinical characteristics data of one or more clinical characteristics of the patient.
- Step b can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or Attorney Docket No.225234-718601/PCT a benign solid tumor.
- Step c can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor.
- Step d can include electronically outputting a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
- the gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, and the like.
- the solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney, or brain tumor.
- the solid tumor is a lung tumor.
- the solid tumor is a pancreatic tumor.
- the solid tumor is an ovarian tumor.
- the solid tumor is a kidney tumor.
- the solid tumor is a brain tumor.
- the solid tumor is a pancreatic tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is an ovarian tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is an ovarian tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is a kidney tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the solid tumor is a kidney tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a kidney tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a brain tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the solid tumor is a brain tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a brain tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of Attorney Docket No.225234-718601/PCT classifying a brain tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to brain cancer.
- the gene set capable of classifying a solid tumor e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g.
- the solid tumor is a lung tumor
- the data set of step a contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the at least two lung disease-associated genes are selected from the group of genes listed in Table 4.
- the at least two lung disease-associated genes are selected from the group of genes listed in Table 1.
- the at least two lung disease-associated genes are selected from the group of genes listed in Table 2. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 3. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 5. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 7. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 8. In some embodiments, the at least two lung disease-associated genes, e.g.
- step a include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the at least two lung disease-associated genes e.g.
- step a include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the at least two lung disease-associated genes e.g.
- step a include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the at least two lung disease-associated genes e.g.
- step a include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the at least two lung disease-associated genes includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, Attorney Docket No.225234-718601/PCT 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142 genes selected from the group of genes listed in Table 5.
- the at least two lung disease-associated genes e.g.
- step a includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7.
- the at least two lung disease- associated genes of step a are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3.
- the at least two lung disease- associated genes include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8..
- the at least two lung disease-associated genes of step a are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4.
- Table A provides examples of Gene ID numbers for genes listed herein in the Tables, including Tables 7 and 8, as described in OMIM® - Online Mendelian Inheritance in Man (McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD) and in the National Center for Biotechnology Information gene database (NCBI, U.S. National Library of Medicine 8600 Rockville Pike, Bethesda MD, 20894 USA), each incorporated by reference herein in their entirety.
- the solid tumor is a lung tumor, and the one or more clinical characteristics includes, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor, and the one or more clinical characteristics includes size of the tumor (e.g. lung nodule).
- the solid tumor is a lung tumor, and the one or more clinical characteristics of the patient includes age of the patient.
- the solid tumor is a lung tumor, and the one or more clinical characteristics includes presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor
- the one or more clinical characteristics includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the at least two lung disease-associated genes of step a comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step a, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the at least two lung disease-associated genes of step a consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step a, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor
- the data set of step a contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- Attorney Docket No.225234-718601/PCT associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the data set of step a contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of the clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- CSF cerebrospinal fluid
- the biological sample is a blood sample or any derivative thereof.
- the biological sample is PBMCs or any derivative thereof.
- the biological sample is a lung biopsy sample, or any derivative thereof.
- the biological sample is a nasal fluid sample, or any derivative thereof.
- the biological sample is a saliva sample, or any derivative thereof.
- the biological sample is a urine sample, or any derivative thereof.
- the biological sample is a stool sample, or any derivative thereof.
- the biological sample is CSF sample, or any derivative thereof.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, Attorney Docket No.225234-718601/PCT at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%
- the method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Attorney Docket No.225234-718601/PCT [0354]
- the machine learning model e.g.
- step b can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a receiver operating characteristic (ROC) curve with an AUC of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- the inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the solid tumor is malignant. Higher confidence values may be correlated with a higher likelihood that the solid tumor is malignant.
- a malignant tumor may be characterized by or by having to ability to metastasize or grow invasively, which may be in contrast to benign tumor.
- the patient has a cancer. In some embodiments, the patient does not have cancer. In some embodiments, the patient is suspected of having a cancer. In some embodiments, the patient is at an elevated risk of having a cancer.
- the patient is asymptomatic for a cancer.
- Cancer can be lung cancer, pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer.
- the patient has pancreatic cancer.
- the patient does not have pancreatic cancer.
- the patient is suspected of having pancreatic cancer.
- the patient is at an elevated risk of having a pancreatic cancer.
- the patient is asymptomatic for pancreatic cancer.
- the patient has ovarian cancer.
- the patient does not have ovarian cancer.
- the patient is suspected of having ovarian cancer.
- the patient is at an elevated risk of having ovarian cancer.
- the patient is asymptomatic for ovarian cancer. In some embodiments, the patient has kidney cancer. In some embodiments, the patient does not have kidney cancer. In some embodiments, the patient is suspected of having kidney cancer. In some embodiments, the patient is at an elevated risk of having a kidney cancer. In some embodiments, the patient is asymptomatic for kidney cancer. In some embodiments, the patient has brain cancer. In some embodiments, the patient does not have brain cancer. In some embodiments, the patient is suspected of having brain cancer. In some embodiments, the patient is at an elevated risk of having brain cancer. In some embodiments, the patient is asymptomatic for brain cancer. In some embodiments, the patient has a lung cancer. In some embodiments, the patient does not have lung cancer.
- the patient is suspected of having a lung cancer. In some embodiments, the patient is at an elevated risk of having a lung cancer. In some embodiments, the patient is asymptomatic for a lung cancer. [0357] In certain embodiments, the method includes optionally performing biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In certain embodiments, the method includes optionally performing a biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. In some embodiments, a biopsy is performed. In Attorney Docket No.225234-718601/PCT some embodiments, a biopsy is not performed.
- the decision to perform a biopsy may be made by one of skill in the art, based on knowledge and experience, in view of the classification of the lung nodule of the patient as the malignant lung nodule or the benign lung nodule.
- the decision to perform a biopsy may depend in part on the confidence value of the inference.
- biopsy of the solid tumor of the patient is not performed.
- the method further contains administering a treatment to the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
- the method contains administering a treatment to the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor.
- the treatment is configured to treat a cancer of the patient. In some embodiments, the treatment is configured to reduce a severity of a cancer of the patient. In some embodiments, the treatment is configured to reduce a risk of having a cancer of the patient.
- the treatment can include one or more treatments of cancer.
- the cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the treatment can be treatment for the lung, pancreatic, ovarian, or brain cancer respectively.
- the method includes administering a treatment to the patient based at least in part on the classification of the pancreatic tumor of the patient as the malignant tumor or the benign tumor.
- the method comprises administering a treatment of pancreatic cancer to the patient based at least in part on the classification of the pancreatic tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the ovarian tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method include administering a treatment of ovarian cancer to the patient based at least in part on the classification of the ovarian tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the kidney tumor of the patient as the malignant tumor or the benign tumor.
- the method comprises administering a treatment of kidney cancer to the patient based at least in part on the classification of the kidney tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the brain tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method includes administering a treatment of brain cancer to the patient based at least in part on the classification of the brain tumor of the patient as malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the lung tumor of the patient as the malignant tumor or the benign tumor.
- the method comprises administering a treatment of lung cancer to the patient based at least in part on the classification of the lung tumor of the patient as the malignant tumor.
- the treatment can include one or more treatments of cancer.
- the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof.
- the machine learning model of step b can be a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant.
- the machine-learning model, e.g. of step b, Attorney Docket No.225234-718601/PCT can generate the inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor, by comparing the data set to a reference data set.
- the machine-learning model can be trained using the reference data set according to the methods described herein.
- the reference data set can contain gene expression measurements of a plurality of reference biological samples from a plurality of reference subjects having solid tumor, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, data regarding whether the solid tumors of the reference subjects are benign or malignant, and optionally clinical characteristics data of one or more clinical characteristics of the reference subjects.
- a first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor.
- the reference data set contains a plurality of individual reference data sets.
- a respective individual reference data set of the plurality of individual reference data sets can contain i) gene expression measurements of a reference biological sample from a reference subject having a reference solid tumor of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject.
- the plurality of individual reference data sets can be obtained from the plurality of reference subjects. In some embodiments, different individual reference data sets are obtained from different reference subjects.
- each of the individual reference data set contains i) gene expression measurements of a reference biological sample from one reference subject of the at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, ii) data regarding whether the lung nodule of the one reference subject is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, and, wherein different individual reference data sets are obtained from different reference subjects.
- oversampling or undersampling correction can be made during training of the machine learning model.
- a reference data set includes a greater number of samples identified as benign and a relatively fewer number of samples identified as malignant
- the malignant samples may be oversampled to produce a data set that has equal number of benign and malignant samples.
- the reference tumor, and solid tumor can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor.
- the solid tumor is a pancreatic tumor
- the method can classify a pancreatic tumor as malignant or a benign
- the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the solid tumor is a pancreatic tumor
- the method can classify a pancreatic tumor as malignant or a benign
- the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor
- the method can classify a pancreatic tumor as malignant or a benign
- the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a Attorney Docket No.225234-718601/PCT pancreatic tumor as benign or malignant, and ii) clinical characteristic data of one or more characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is ovarian tumor
- the method can classify an ovarian tumor as malignant or a benign
- the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the solid tumor is an ovarian tumor
- the method can classify an ovarian tumor as malignant or a benign
- the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is an ovarian tumor
- the method can classify an ovarian tumor as malignant or a benign
- the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is a kidney tumor
- the method can classify a kidney tumor as malignant or a benign
- the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the solid tumor is a kidney tumor
- the method can classify a kidney tumor as malignant or a benign
- the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor
- the method can classify a kidney tumor as malignant or a benign
- the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristic data of one or more characteristics selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a brain tumor
- the method can classify a brain tumor as malignant or a benign
- the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the solid tumor is a brain tumor
- the method can classify a brain tumor as malignant or a benign
- the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a brain tumor
- the method can classify a brain tumor as malignant or a benign
- the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the genes of the data set and genes of the reference data set can at least partially overlap.
- clinical characteristics of the data set and clinical characteristics of the reference data set can at least partially overlap.
- the gene set capable of classifying a solid tumor e.g.
- pancreatic, ovarian, kidney, brain tumor, as benign or malignant can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein.
- Attorney Docket No.225234-718601/PCT [0360]
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the gene set of reference data set is the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and one or more clinical characteristics of the reference data set is selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7.
- the solid tumor is a lung tumor
- the at least two genes of the reference data set comprise the 31 genes listed in Table 7
- the one or more clinical characteristics of the reference data set comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the at least two genes of the reference data set Attorney Docket No.225234-718601/PCT consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of the reference data set consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21, genes selected from the group of genes listed in Table 8.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set include, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set includes size of the nodule.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set includes age of the patient.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set includes presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the one or more clinical characteristics of the reference data set includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and the one or more clinical characteristics of the reference data set includes 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6
- the solid tumor is a lung tumor
- the method can classify a lung tumor as malignant or a benign
- the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and the one or more clinical characteristics of the reference data set include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the genes of the data set and genes of the reference data set can at least partially overlap and/or the optional clinical characteristics of the data set and optional clinical characteristics of the reference data set can at least partially overlap.
- the reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- CSF cerebrospinal fluid
- the reference biological sample is a blood sample or any derivative thereof.
- the reference biological sample is PBMCs or any derivative thereof.
- the reference biological sample is a lung biopsy sample, or any derivative thereof.
- the reference biological sample is a nasal fluid sample, or any derivative thereof.
- the reference biological sample is a saliva sample, or any derivative Attorney Docket No.225234-718601/PCT thereof.
- the reference biological sample is a urine sample, or any derivative thereof.
- the reference biological sample is a stool sample, or any derivative thereof.
- the reference biological sample is CSF sample, or any derivative thereof.
- the reference subjects can be human.
- Gene expression data can be obtained by a data analysis tool selected from the group: a BIG-CTM big data analysis tool, an I-ScopeTM big data analysis tool, a T-ScopeTM big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring TM analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope).
- a data analysis tool selected from the group: a BIG-CTM big data analysis tool, an I-ScopeTM big data analysis tool, a T-ScopeTM big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring TM analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope).
- the trained machine learning model e.g. of step b, is a supervised machine learning algorithm or an unsupervised machine learning algorithm.
- the trained machine learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof.
- the trained machine learning model is trained using LOG.
- the trained machine learning model is trained using Ridge regression.
- the trained machine learning model is trained using Lasso regression.
- the trained machine learning model is trained using GLM.
- the trained machine learning model is trained using kNN. In some embodiments, the trained machine learning model is trained using SVM. In some embodiments, the trained machine learning model is trained using GBM. In some embodiments, the trained machine learning model is trained using RF. In some embodiments, the trained machine learning model is trained using NB. In some embodiments, the trained machine learning model is trained using the EN regression. In some embodiments, the trained machine learning model is trained using neural network. In some embodiments, the trained machine learning model is trained using deep learning algorithm. In some embodiments, the trained machine learning model is trained using LDA. In some embodiments, the trained machine learning model is trained using DTREE. In some embodiments, the trained machine learning model is trained using ADB.
- the method comprises determining a likelihood of the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
- the likelihood is about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, or about 100%.
- the likelihood is least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Attorney Docket No.225234-718601/PCT [0365]
- the method further comprises monitoring the solid tumor of the patient, wherein the monitoring comprises assessing the solid tumor of the patient at a plurality of time points.
- a difference in the assessment of the solid tumor of the patient among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the solid tumor of the patient, (ii) a prognosis of the solid tumor of the patient, and (iii) an efficacy or non-efficacy of a course of treatment for treating the solid tumor of the patient.
- the plurality of time points comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or 50 different time points.
- the present disclosure provides a method for determining a gene set capable of classifying a solid tumor, as benign or malignant.
- Gene expression measurements of one or more genes of the gene set of a biological sample (e.g. blood) from a patient can be used to classify a solid tumor of the patient, as benign or malignant without performing biopsy of the solid tumor.
- a biopsy of the solid tumor can be performed to confirm and/or follow-up the classification results obtained by using the gene expression measurement data.
- a biopsy of the solid tumor is not performed.
- the method can include any one of, any combination of, or all of steps a’, b’, c’ and d’.
- a reference data set can be obtained and/or provided.
- the reference data set can contain i) gene expression measurements of a plurality of genes of reference biological samples from reference subjects each having at least one reference solid tumor, ii) data regarding whether the reference tumors are benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subjects.
- the reference data set can contain a plurality of individual reference data sets.
- a respective individual reference data set of the plurality of individual reference data sets can contain i) gene expression measurements of the plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference tumor is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the reference subject.
- the plurality of individual reference data sets can be obtained from a plurality of reference subjects. In some embodiments, different individual reference data sets can be obtained from different reference subjects.
- each of the individual reference data set contains i) gene expression measurements of the plurality of genes of a reference biological sample from one reference subject, ii) data regarding whether the solid tumor of the one reference subject is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, wherein different individual reference data sets are obtained from different reference subjects.
- a first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor.
- a machine learning model can be trained using the reference data set to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics.
- the machine learning model can be trained using a training Attorney Docket No.225234-718601/PCT data set containing a first portion of the reference data set, and a validation data set containing a second portion of the reference data set.
- oversampling or undersampling correction can be made during training of the machine learning model.
- a data set includes a greater number of samples identified as benign and a relatively fewer number of samples identified as malignant
- the malignant samples may be oversampled to produce a data set that has equal number of benign and malignant samples.
- feature importance values of the plurality of genes can be determined.
- the gene set can be selected.
- the gene set can be selected as predictors that are used to train the machine learning model.
- the gene set may be selected based at least in part on feature importance values.
- the feature importance values of the genes of the gene set are within top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, feature importance values of the plurality of genes.
- the feature importance of the genes of the gene set can have accuracy, greater than 30 %, 35 %, 40 %, 45 %, 50 %, 55 %, 60 %, 65 %, 70 %, 75 %, or 80 % or 90 %. In some embodiments, the feature importance of the genes of the gene set, can have threshold importance, greater than 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80 or 90.
- the top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the machine learning model include the genes of the gene set.
- the solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney tumor, or brain tumor.
- the solid tumor is a lung tumor.
- the solid tumor is a pancreatic tumor.
- the solid tumor is an ovarian tumor.
- the solid tumor is a kidney tumor.
- the solid tumor is a brain tumor.
- the reference tumor, and solid tumor can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 300, 400, 500, 600, 700, 800, 900, 1000, 1100 or 1178, or any value or range there between, genes selected from
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor
- the one or more clinical characteristics of the reference data set of step a’ include 1, 2, 3, 4, 5, 6, 7, or 8, clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the plurality of genes of the reference data set of step a’ include at least 2 genes selected from a group of genes listed in any one or more of Tables 1, 2, 3, 4, 5 and 9, and the one or more clinical characteristics of the reference data set of step a’, include 1, 2, 3, 4, 5, 6, 7, or 8 clinical Attorney Docket No.225234-718601/PCT characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a pancreatic tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to pancreatic cancer.
- the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to pancreatic cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is an ovarian tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to ovarian cancer.
- the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is an ovarian tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to ovarian cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is a kidney tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to kidney cancer.
- the solid tumor is a kidney tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to kidney cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a brain tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to brain cancer.
- the solid tumor is a brain tumor, and the one or more clinical characteristics of the reference data of step a’, set are selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a brain tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to brain cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to brain cancer.
- Feature selection techniques can include least absolute Attorney Docket No.225234-718601/PCT shrinkage and selection operator (Lasso) regression, support vector machine (SVM), regularized trees, decision trees, memetic algorithm, random multinomial logit (RMNL), auto-encoding networks, submodular feature selection, recursive feature elimination, or any combination thereof.
- feature importance values need not be calculated for each of the genes of the plurality of genes.
- the reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- the reference biological sample is a blood sample or any derivative thereof.
- the reference biological sample is PBMCs or any derivative thereof.
- the reference biological sample is a lung biopsy sample, or any derivative thereof.
- the reference biological sample is a nasal fluid sample, or any derivative thereof.
- the reference biological sample is a saliva sample, or any derivative thereof.
- the reference biological sample is a urine sample, or any derivative thereof.
- the reference biological sample is a stool sample, or any derivative thereof.
- the reference biological sample is CSF sample, or any derivative thereof.
- the reference subjects can be human.
- the machine learning model e.g. of step b’, can be trained using a supervised machine learning algorithm or an unsupervised machine learning algorithm.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about Attorney Docket No.225234-718601/PCT 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the present disclosure provides a method for developing a trained machine learning model capable of classifying a solid tumor of a patient, as benign or malignant.
- the method can include any one of, any combination of, or all of steps a”, b”, c”, d” and e”.
- Step a can include obtaining and/or providing a first reference data set.
- the first reference data set can contain i) gene expression measurements of a plurality of genes of reference biological samples from reference subjects each having at least one reference solid tumor, ii) data regarding whether the reference tumors are benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subjects.
- the first reference data set can contain a plurality of first individual reference data sets.
- a respective first individual reference data set of the plurality of first individual reference data sets can contain i) gene expression measurements of the plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) clinical characteristics data of the one or more clinical characteristics of the reference subject.
- the plurality of first individual reference data sets can be obtained from a plurality of reference subjects. In some embodiments, different first individual reference data sets are obtained from different reference subjects.
- each of the first individual reference data set contains i) gene expression measurements of the plurality of genes of a reference biological sample from one reference subject, iii) Attorney Docket No.225234-718601/PCT data regarding whether the reference solid tumor of the one reference subject is benign or malignant and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, wherein different first individual reference data sets are obtained from different reference subjects.
- a first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor.
- a first machine learning model can be trained using the first reference data set to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics.
- the first machine learning model can be trained to infer whether the solid tumor is benign or malignant, based at least in part on the measurement data of the plurality of genes, and optionally the clinical characteristics data of the one or more clinical characteristics.
- the first machine learning model can be trained using a training data set containing a first portion of the first reference data set, and a validation data set containing a second portion of the first reference data set.
- step c’ feature importance values of one or more predictors of the first machine learning model can be determined.
- step d’ A predictors of the first machine learning model based at least in part on the feature importance values can be selected, where A can be an integer from 3 to 2000, such as 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, 250, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600,
- the A predictors can have top A feature importance values, for example, in a non-limiting aspect, A can be 10, and 10 predictors having 10 highest feature importance values can be selected.
- the feature importance of the A predictors can have an accuracy, greater than 30 %, 35 %, 40 %, 45 %, 50 %, 55 %, 60 %, 65 %, 70 %, 75 %, 80 % or 90%.
- the feature importance of the A predictors can have a threshold importance, greater than 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80 or 90.
- the A predictors form top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the first machine learning model.
- a predictors can include one or more genes, and/or optionally one or more clinical characteristics. While determining feature importance values for the one or more predictors has been described, this is merely a non-limiting illustrative example of a technique that can be practiced. In various embodiments, one or more feature selection techniques are used to determine the A predictors.
- Attorney Docket No.225234-718601/PCT Feature selection techniques can include least absolute shrinkage and selection operator (Lasso) regression, support vector machine (SVM), regularized trees, decision trees, memetic algorithm, random multinomial logit (RMNL), auto-encoding networks, submodular feature selection, recursive feature elimination, or any combination thereof.
- Step e can include training a second machine learning model based at least in part on a second reference data set to obtain the trained machine learning model.
- the second reference data set can contain i) measurement data of the A predictors of the reference subjects, and ii) data regarding whether the solid tumors of the reference subjects are benign or malignant.
- the second reference data set can contain a plurality of second individual reference data sets.
- a respective second individual reference data set of the plurality of second individual reference data sets can include i) measurement data of the A predictors of a reference subject, and ii) data regarding whether the solid tumor of the reference subject is benign or malignant.
- the plurality of second individual reference data sets can be obtained from the plurality of reference subjects. In some embodiments, different second individual reference data sets are obtained from different reference subjects. In some embodiments, each of the second individual reference data set contains i) measurement data of the A predictors of one reference subject, and ii) data regarding whether the solid tumor of the one reference subject is benign or malignant, wherein different second individual reference data sets are obtained from different reference subjects.
- Measurement data of the A predictors can include, gene expression measurements in the reference sample of the one or more genes features of the A predictors, and/or optionally clinical characteristics data of one or more clinical characteristics features of the A predictors.
- the trained machine learning model can infer whether a solid tumor is benign or malignant, based at least in part on measurement data of the A predictors.
- the one or more genes features of the A predictors can form the gene set capable of classifying a solid tumor, as benign or malignant.
- oversampling or undersampling correction can be made during training of the first and/or second machine learning model.
- the solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor.
- the solid tumor is a lung tumor.
- the solid tumor is a pancreatic tumor.
- the solid tumor is an ovarian tumor.
- the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor.
- the reference tumor, and solid tumor can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor.
- the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from a group of genes listed in Table 9.
- the solid tumor is a lung tumor
- the plurality of genes of the first reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, Attorney Docket No.225234-718601/PCT 300, 400, 500, 600, 700, 800, 900, 1000, 1100 or 1178, or any
- the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 4.
- the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor, and the one or more clinical characteristics of the first reference data set include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from a group of genes listed in any one or more of Tables 1, 2, 3, 4, 5 and 9, and the one or more clinical characteristics of the first reference data set include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a pancreatic tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to pancreatic cancer. In some particular embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to pancreatic cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from pancreatic cancer.
- the solid tumor is an ovarian tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to ovarian cancer. In some particular embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to ovarian cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from ovarian cancer.
- the solid tumor is a kidney tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to kidney cancer.
- the solid tumor is a kidney tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to kidney cancer, and the one or Attorney Docket No.225234-718601/PCT more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from kidney cancer.
- the solid tumor is a brain tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to brain cancer. In some particular embodiments, the solid tumor is a brain tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to brain cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from brain cancer. In some embodiments, genes having collinear expression with correlation coefficients (e.g.
- the solid tumor is a lung tumor
- the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5, as predictors.
- the solid tumor is a lung tumor
- the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes of genes selected from the group listed in Table 7, as predictors.
- the solid tumor is a lung Attorney Docket No.225234-718601/PCT tumor, and the A predictors can at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8, as predictors.
- the solid tumor is a lung tumor
- the A predictors include i) at least, 2 genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8and ii) optionally size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof, as predictors.
- the solid tumor is a lung tumor
- the A predictors can include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33 or 34 predictors selected from the group listed in Table 7.
- the solid tumor is a lung tumor, the A predictors comprise the 34 predictors listed in Table 7. In some embodiments, the solid tumor is a lung tumor, the A predictors consist the 34 predictors listed in Table 7.
- the reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof.
- the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof.
- the trained machine learning model e.g.
- a solid tumor is benign or malignant with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the trained machine learning model e.g.
- a solid tumor is benign or malignant with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the trained machine learning model e.g.
- a solid tumor is benign or malignant with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the trained machine learning model e.g.
- a solid tumor is benign or malignant with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about Attorney Docket No.225234-718601/PCT 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the trained machine learning model e.g.
- a solid tumor is benign or malignant with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the trained machine learning model e.g.
- a solid tumor is benign or malignant with a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- Gene expression data can be obtained by a data analysis tool selected from the group: a BIG-CTM big data analysis tool, an I-ScopeTM big data analysis tool, a T-ScopeTM big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring TM analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope).
- a data analysis tool selected from the group: a BIG-CTM big data analysis tool, an I-ScopeTM big data analysis tool, a T-ScopeTM big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring TM analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope).
- the trained machine learning model is a supervised machine learning algorithm or an unsupervised machine learning algorithm.
- first and/or second machine-learning model is independently trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof.
- the first and/or second machine-learning model is independently trained using LOG.
- the first and/or second machine-learning model is independently trained using Ridge regression.
- the first and/or second machine-learning model is independently trained using Lasso regression. In some embodiments, the first and/or second machine-learning model is independently trained using GLM. In some embodiments, the first and/or second machine-learning model is independently trained using kNN. In some embodiments, the first and/or second machine-learning model is independently trained using SVM. In some embodiments, the first and/or second machine- learning model is independently trained using GBM. In some embodiments, the first and/or second machine-learning model is independently trained using RF. In some embodiments, the first and/or second machine-learning model is independently trained using NB. In some embodiments, the first and/or second machine-learning model is independently trained using the EN regression.
- the first and/or second machine-learning model is independently trained using neural network. In some embodiments, the first and/or second machine-learning model is independently trained using deep learning algorithm. In some embodiments, the first and/or second machine-learning model is Attorney Docket No.225234-718601/PCT independently trained using LDA. In some embodiments, the first and/or second machine-learning model is independently trained using DTREE. In some embodiments, the first and/or second machine-learning model is independently trained using ADB. [0377] In an aspect, the present disclosure provides a method for treating cancer in a patient. In some embodiments, the patient has a solid tumor.
- the method can include, any one of, any combination of, or all of steps a”’, b”’, c”’ and d”’.
- Step a”’ can include obtaining a data set containing i) gene expression measurements of a biological sample from the patient, of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient.
- the gene expression measurements can be obtained by assaying the biological sample.
- Step b’ can include providing the data set as input to a machine- learning model trained to generate an inference of whether the data set is indicative of the patient having or not having cancer.
- the inference infer whether the data set is indicative of the solid tumor of the patient is malignant or benign.
- Step c can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the patient having or not having cancer.
- the inference received as an output indicate whether the solid tumor of the patient is malignant or the benign.
- Step d can include administering a treatment based on the determination that the patient has cancer.
- the treatment is be administering based on the patient’s solid tumor being classified as malignant.
- the cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the solid tumor can be a pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor respectively.
- the cancer is lung cancer, and the solid tumor is a lung tumor.
- the cancer is pancreatic cancer, and the solid tumor is a pancreatic tumor.
- the cancer is ovarian cancer, and the solid tumor is an ovarian tumor.
- the cancer is kidney cancer, and the solid tumor is a kidney tumor.
- the cancer is brain cancer, and the solid tumor is a brain tumor.
- the cancer is lung cancer
- the gene set of reference data set is the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and optional one or more clinical characteristics of the reference data set is selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer
- the dataset of step a”’ contains i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the patient.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step a”’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the Attorney Docket No.225234-718601/PCT data set of step a”’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step a”’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step a”’ include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step a”’ include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5.
- the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7
- the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a’”, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3.
- the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8.
- the cancer is lung cancer, the at least two lung disease-associated genes of step a”’, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4.
- the cancer is lung cancer
- the one or more clinical characteristics of the data set of step a’ include 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer, and the one or more clinical characteristics of the data set of step a’”, include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the cancer is lung cancer
- the data set of step a”’ contains i) gene expression Attorney Docket No.225234-718601/PCT measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer, and the data set of step a”’, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the cancer is lung cancer, the at least two lung disease-associated genes of step a’”, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step a’” comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the cancer is lung cancer, the at least two lung disease-associated genes of step a’, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step a”” consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the cancer is pancreatic cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the cancer is pancreatic cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the cancer is pancreatic cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the cancer is ovarian cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the cancer is ovarian cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the cancer is ovarian cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the cancer is kidney cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the cancer is kidney cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the cancer is kidney cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical Attorney Docket No.225234-718601/PCT characteristics selected from a group of clinical characteristics related to kidney cancer.
- the cancer is brain cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the cancer is brain cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the cancer is brain cancer
- the data set of step a”’ contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the gene set capable of classifying a solid tumor e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein.
- the inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the tumor is malignant, where higher confidence values may be correlated with a higher likelihood that the tumor is malignant.
- a malignant tumor may be characterized by the ability to metastasize or grow invasively, which may be in contrast to benign nodules.
- the biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- CSF cerebrospinal fluid
- the biological sample is a blood sample or any derivative thereof.
- the biological sample is PBMCs or any derivative thereof.
- the biological sample is a lung biopsy sample, or any derivative thereof.
- the biological sample is a nasal fluid sample, or any derivative thereof.
- the biological sample is a saliva sample, or any derivative thereof.
- the biological sample is a urine sample, or any derivative thereof.
- the biological sample is a stool sample, or any derivative thereof.
- the biological sample is CSF sample, or any derivative thereof.
- the method includes optionally performing biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In certain embodiments, the method includes optionally performing a biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. The decision to perform biopsy may depend on confidence value of the inference. In certain embodiments, biopsy of the solid tumor of the patient is not performed.
- the machine-learning model e.g.
- step b”’ can generate the inference of whether the data set is indicative of the patient having a malignant solid tumor or a benign solid tumor, wherein the patient having malignant solid tumor may indicate the patient has cancer, and the patient having benign solid tumor may indicate the patient does not have cancer.
- the machine-learning model of step b”’ can be Attorney Docket No.225234-718601/PCT trained according to a method described herein, e.g. according to the methods training of the machine- learning model of step b.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning Attorney Docket No.225234-718601/PCT model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 80 % to about
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 80 % to about
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a Attorney Docket No.225234-718601/PCT negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
- the machine learning model of step b”’ can infer whether the data set is indicative of the patient having or not having cancer with a receiver operating characteristic (ROC) curve with an Area- Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about Attorney Docket No.225234-718601/PCT 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of about 0.8 to about 1. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.99
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995.
- the machine learning model of step b”’ infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1.
- the treatment is configured to treat a cancer of the patient.
- the treatment is configured to reduce a severity of a cancer of the patient.
- the treatment is configured to reduce a risk of having a cancer of the patient.
- the treatment can include one or more treatments of cancer.
- the cancer can be lung, pancreatic, ovarian, kidney, or Attorney Docket No.225234-718601/PCT brain cancer.
- the cancer is lung cancer.
- the cancer is pancreatic cancer.
- the cancer is ovarian cancer.
- the cancer is kidney cancer.
- the cancer is brain cancer.
- the data set is indicative of the patient having lung cancer, and step d”’ can include administering to the patient a treatment for lung cancer.
- the data set is indicative of the patient having pancreatic cancer, and step d”’ can include administering to the patient a treatment for pancreatic cancer.
- the data set is indicative of the patient having ovarian cancer, and step d”’ can include administering to the patient a treatment for ovarian cancer.
- the data set is indicative of the patient having kidney cancer, and step d”’ can include administering to the patient a treatment for kidney cancer.
- the data set is indicative of the patient having brain cancer, and step d”’ can include administering to the patient a treatment for brain cancer.
- the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof.
- the method can include, any one of, any combination of, or all of steps w, x, y and z.
- Step w can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient.
- the data set can be obtained from assaying the biological sample.
- Step x can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor.
- Step y can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor.
- Step z can include performing biopsy of the solid tumor based on the solid tumor being classified as the malignant solid tumor or benign tumor.
- step z can include performing biopsy of the solid tumor based on the solid tumor being classified as the malignant solid tumor. The decision to perform biopsy may depend on confidence value of the inference.
- biopsy of the lung nodule of the patient is not performed.
- the gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, or the like.
- the solid tumor can be a lung tumor, pancreatic tumor, kidney tumor, ovarian tumor, or a brain tumor.
- the solid tumor is a lung tumor.
- the solid tumor is a pancreatic tumor.
- the solid tumor is an ovarian tumor.
- the solid tumor is a kidney tumor.
- the solid tumor is a brain tumor [0392]
- the solid tumor is a lung tumor
- the data set of step w contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8and ii) optionally Attorney Docket No.225234-718601/PCT clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the at least two lung disease- associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of step w are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3.
- the solid tumor is a lung tumor
- the at least two lung disease-associated genes of the data set of step w includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8.
- the solid tumor is a lung Attorney Docket No.225234-718601/PCT tumor
- the at least two lung disease-associated genes of step w are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4.
- the solid tumor is a lung tumor
- the one or more clinical characteristics of the data set of step w includes 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group listed in Table 6.
- the solid tumor is a lung tumor, and one or more clinical characteristics of the data set of step w, includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the solid tumor is a lung tumor, and the data set of step w, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6.
- the solid tumor is a lung tumor
- the data set of step w contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step w, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step w, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the solid tumor is a pancreatic tumor
- the data set of step w contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the solid tumor is a pancreatic tumor
- the data set of step w contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is a pancreatic tumor
- the data set of step w contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the solid tumor is an ovarian tumor
- the data set of step w contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the solid tumor is an ovarian tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is an ovarian tumor
- the data set of Attorney Docket No.225234-718601/PCT step w contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the solid tumor is a kidney tumor
- the data set of step w contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the solid tumor is a kidney tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a kidney tumor, and the data set of step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the solid tumor is a brain tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the solid tumor is a brain tumor
- the data set of step w contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the solid tumor is a brain tumor
- the data set of step w contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the gene set capable of classifying a solid tumor e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g.
- the biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- CSF cerebrospinal fluid
- the biological sample is a blood sample or any derivative thereof.
- the biological sample is PBMCs or any derivative thereof.
- the biological sample is a lung biopsy sample, or any derivative thereof.
- the biological sample is a nasal fluid sample, or any derivative thereof.
- the biological sample is a saliva sample, or any derivative thereof.
- the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof.
- the machine learning model of step x can be a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant.
- the machine-learning model, e.g. of step x can be trained according to the methods described herein, e.g. as of the machine learning model of step b.
- Certain aspects are directed to a method for determining cancer in a patient. The method can include, any one of, any combination of, or all of steps w’, x’, y’ and z’.
- Step w’ can include obtaining a Attorney Docket No.225234-718601/PCT data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from the gene set capable of classifying a solid tumor of the patient as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient.
- the gene expression measurements can be obtained by assaying the biological sample.
- Step x’ can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of the patient having or not having cancer.
- Step y’ can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the patient having or not having cancer.
- Step z’ can include electronically outputting a report indicating the patient has, or does not have cancer.
- the gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, or the like.
- the cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the solid tumor can be a pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor respectively.
- the cancer is lung cancer, and the solid tumor is a lung tumor.
- the cancer is pancreatic cancer, and the solid tumor is a pancreatic tumor.
- the cancer is ovarian cancer, and the solid tumor is an ovarian tumor.
- the cancer is kidney cancer, and the solid tumor is a kidney tumor.
- the cancer is brain cancer, and the solid tumor is a brain tumor.
- the cancer is lung cancer, and the data set of step w’, contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, Attorney Docket No.225234-718601/PCT 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142 genes selected from the group of genes listed in Table 5.
- the cancer is lung cancer
- the at least two lung disease-associated genes of the data set of step w’ include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7.
- the cancer is lung cancer
- the at least two lung disease- associated genes of step w’ are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3.
- the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8.
- the cancer is lung cancer, the at least two lung disease-associated genes of step w, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4.
- the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group listed in Table 6.
- the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes size of the nodule.
- the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes age of the patient.
- the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes presence of the nodule in the lung upper lobe.
- the cancer is lung cancer
- the one or more clinical characteristics of the dataset of step w’ include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the cancer is lung cancer
- the data set of step w’ contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6.
- the cancer is lung cancer
- the data set of step w’ contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-associated genes selected from the group Attorney Docket No.225234-718601/PCT of genes listed in Table 7, and ii) clinical characteristics data of the clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof.
- the cancer is lung cancer, and the at least two lung disease- associated genes of step w’, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step w’, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the cancer is lung cancer, and the at least two lung disease-associated genes of step w’, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step w’, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe.
- the cancer is pancreatic cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant.
- the cancer is pancreatic cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the cancer is pancreatic cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer.
- the cancer is ovarian cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant.
- the cancer is ovarian cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the cancer is ovarian cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
- the cancer is kidney cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant.
- the cancer is kidney cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the cancer is kidney cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer.
- the cancer is brain cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant.
- the cancer is brain cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the cancer is brain cancer, and the Attorney Docket No.225234-718601/PCT data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer.
- the gene set capable of classifying a solid tumor e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein.
- the biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof.
- the biological sample is a blood sample or any derivative thereof.
- the biological sample is PBMCs or any derivative thereof.
- the biological sample is a lung biopsy sample, or any derivative thereof.
- the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof.
- the method can determine whether the patient has or does not have cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with an sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with an specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the method can determine whether the patient has or does not have cancer with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least Attorney Docket No.225234-718601/PCT about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- the machine learning model e.g.
- step x’ can infer whether the data set is indicative of the patient having or not having cancer with a receiver operating characteristic (ROC) curve with an AUC of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.
- ROC receiver operating characteristic
- the inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the patient has cancer. Higher confidence values may be correlated with a higher likelihood that the patient has is cancer.
- the machine-learning model e.g. of step x’, can generate inference of whether the data set is indicative of the patient having a malignant solid tumor or a benign solid tumor, wherein the patient having malignant solid tumor may indicate that the patient has cancer, and patient having benign solid tumor may indicate that the patient does not have lung cancer.
- the machine-learning model can be trained according to a method described herein, e.g. according to the methods of training of the machine learning model of step b.
- the present disclosure provides a computer system for assessing a solid tumor of a subject, containing: a database or other suitable data storage system that is configured to store a dataset containing a) gene expression measurements of a biological sample obtained or derived from the subject, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) analyze the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor; (ii) electronically output a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor.
- Computer-implemented methods as described herein may be executed on computer systems such as those described above.
- a computer system may comprise one or more processors and one or more memory units that collectively store computer- readable executable instructions that, as a result of execution, cause the one or more processors to collectively perform the programmed steps described above.
- a computer system as described herein may comprise an assay device communicatively coupled to a personal computer.
- the data set can be a data set (e.g. of step a) described herein.
- the biological sample can be a biological sample described herein.
- the computer system further comprises an electronic display operatively coupled to the one or more computer processors, wherein the electronic display comprises a graphical user interface that is configured to display the report.
- the present disclosure provides one or more non-transitory computer readable media collectively containing machine-executable code that, upon execution by one or more computer processors, causes the one or more computer processors to perform a method for assessing a solid tumor of a subject, the method containing: (a) obtaining a data set containing a) gene expression measurements of a biological sample obtained or derived from a subject, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject; (b) analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor; and (c) electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor.
- the data set can be a data set (e.g. of step a) described herein.
- the biological sample can be a biological sample described herein.
- FIG.10 illustrates an overview of an example method 1000 for assessing a solid tumor of a subject.
- the method 1000 may comprise assaying a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject, as in operation 1002.
- the method 1000 may comprise analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor, as in operation 1004.
- the method 1000 may comprise electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor, as in operation 1006.
- the data set can be a data set (e.g. of step b) described herein.
- Methods of the present disclosure may comprise applying a trained machine learning algorithm to gene expression data (e.g., acquired by RNA-Seq, Ampli-seq, or like) and optionally clinical characteristics data of a subject, to assess a solid tumor of the subject.
- the trained machine learning algorithm may comprise a machine learning based classifier, configured to process the gene expression data and optionally clinical characteristics data to assess the solid tumor (e.g., determine whether a solid tumor is malignant or benign).
- the machine learning classifier may be trained using clinical datasets, e.g. reference data sets from one or more cohorts of subjects, e.g., using gene expression data and optionally clinical health data, e.g. clinical characteristics data of the subjects as inputs and known clinical health outcomes (e.g., a solid tumor that is malignant or benign) of the subjects as outputs to the machine learning classifier.
- clinical datasets e.g. reference data sets from one or more cohorts of subjects, e.g., using gene expression data and optionally clinical health data, e.g. clinical characteristics data of the subjects as inputs and known clinical health outcomes (e.g., a solid tumor that is malignant or benign) of the subjects as outputs to the machine learning classifier.
- clinical health data e.g. clinical characteristics data of the subjects as inputs and known clinical health outcomes (e.g., a solid tumor that is malignant or benign) of the subjects as outputs to the machine learning classifier.
- known clinical health outcomes e.g., a solid tumor
- Examples of machine learning algorithms may include a linear regression, a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB) or any Attorney Docket No.225234-718601/PCT combination thereof, or another supervised learning algorithm or unsupervised learning algorithm for classification and regression.
- a linear regression a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve
- the machine learning classifier may be trained using one or more reference datasets corresponding to subject data (e.g., gene expression data and optionally clinical health data).
- Reference datasets used for training machine learning classifiers may be generated from, for example, one or more cohorts of patients having common clinical characteristics (features) and clinical outcomes (labels).
- Reference datasets may comprise a set of features and labels corresponding to the features.
- Features may correspond to algorithm inputs comprising subject data (e.g., gene expression data and optionally clinical health data, e.g. clinical characteristics data).
- Features may comprise clinical characteristics such as, for example, certain ranges, categories, or levels of gene expression data and optionally clinical health data.
- Features may comprise subject information such as patient age, patient medical history, other medical conditions, current or past medications, size of the nodule, presence of the nodule in the lung upper lobe and/or time since the last observation.
- subject information such as patient age, patient medical history, other medical conditions, current or past medications, size of the nodule, presence of the nodule in the lung upper lobe and/or time since the last observation.
- a set of features collected from a given patient at a given time point may collectively serve as a signature, which may be indicative of clinical health outcomes (e.g., a solid tumor that is malignant or benign) of the subject at the given time point.
- ranges of subject data may be expressed as a plurality of disjoint continuous ranges of continuous measurement values
- categories of subject data e.g., gene expression data and/or clinical health data
- Clinical characteristics may also include clinical labels indicating the subject’s health history, such as a diagnosis of a disease or disorder, a previous administering of a clinical treatment (e.g., a drug, a surgical treatment, chemotherapy, radiotherapy, immunotherapy, etc.), behavioral factors, or other health status (e.g., hypertension or high blood pressure, hyperglycemia or high blood glucose, hypercholesterolemia or high blood cholesterol, history of allergic reaction or other adverse reaction, etc.).
- Clinical characteristics data for the clinical characteristic, AGE of the patient can be age of the patient.
- Clinical characteristics data for the clinical characteristic, SEX of the patient can be sex of the patient.
- Clinical characteristics data for the clinical characteristic, presence of the nodule in the lung upper lobe (NCNUPYN), of the patient can be yes or no.
- Clinical characteristics data for the clinical characteristic, smoking status (MHTBSTAT), of the patient can be past or current.
- Clinical characteristics data for the clinical characteristics, chronic obstructive pulmonary disease (MHCPDYN), of the patient can be yes or no.
- Clinical characteristics data for the clinical characteristics, lung nodule spiculated (NCNMYN), of the patient can be yes or no.
- Clinical characteristics data for the clinical characteristic, emphysemal (MHEMPYN) can be yes or no.
- Labels may comprise clinical outcomes such as, for example, a solid tumor that is malignant or benign.
- the machine learning classifier algorithm may process the input features to generate output values comprising one or more classifications, one or more predictions, or a combination thereof.
- classifications or predictions may include a binary classification of a solid tumor, a classification between a group of categorical labels (e.g., ‘malignant solid tumor’ and ‘benign solid tumor’), a likelihood (e.g., relative likelihood or probability) of having a malignant solid tumor or benign solid tumor, and a confidence interval for any numeric predictions.
- Various machine learning techniques may be cascaded such that the output of a machine learning technique may also be used as input features to subsequent layers or subsections of the machine learning classifier.
- datasets may be sufficiently large to generate statistically significant classifications or predictions. In some cases, datasets are annotated or labeled.
- Datasets may be split into subsets (e.g., discrete or overlapping), such as a training dataset, a development dataset, and a test dataset. For example, a dataset may be split into a training dataset comprising 80% of the dataset, a development dataset comprising 10% of the dataset, and a test dataset comprising 10% of the dataset.
- the training dataset may comprise about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, or about 90% of the dataset.
- the development dataset may comprise about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, or about 90% of the dataset.
- the test dataset may comprise about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, or about 90% of the dataset.
- Training sets (e.g., training datasets) may be selected by random sampling of a set of data corresponding to one or more patient cohorts to ensure independence of sampling.
- training sets may be selected by proportionate sampling of a set of data corresponding to one or more patient cohorts to ensure independence of sampling.
- Reference datasets may be split into subsets (e.g., discrete or overlapping), such as a training dataset, and a validation dataset.
- a reference dataset may be split into a training dataset containing 80% of the dataset, and a validation dataset containing 20% of the dataset.
- the training dataset may contain 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 95%, or any values or ranges there between, of the reference dataset.
- the validation dataset may contain 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 95%, or any values or ranges there between, of the reference dataset. 2, 2.5, 5 or 10, or any values or ranges there between, fold cross validation can be used.
- different performance metrics may be generated. For example, an area under the receiver-operating curve (AUROC) may be used to determine the diagnostic capability of the machine learning classifier.
- the machine learning classifier may use classification thresholds which are adjustable, such that specificity and sensitivity are tunable, and the receiver-operating curve (ROC) can be used to identify the different operating points corresponding to different values of specificity and sensitivity.
- a “false negative” may refer to an outcome in which a solid tumor of a subject is incorrectly classified as a benign solid tumor.
- a “true negative” may refer to an outcome in which a solid tumor of a subject is correctly classified as a benign solid tumor.
- the machine learning classifier may be trained until certain predetermined conditions for accuracy or performance are satisfied, such as having minimum desired values corresponding to diagnostic accuracy measures.
- the diagnostic accuracy measure may correspond to prediction of a likelihood of a solid tumor being malignant or benign.
- diagnostic accuracy measures may include sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, area under the precision-recall curve (AUPRC), and area under the curve (AUC) of a Receiver Operating Characteristic (ROC) curve (AUROC) corresponding to the diagnostic accuracy of determining whether a solid tumor is malignant or benign.
- PV positive predictive value
- NDV negative predictive value
- AUPRC area under the precision-recall curve
- AUC area under the curve
- AUROC Receiver Operating Characteristic
- such a predetermined condition may be that the sensitivity of determining whether a solid tumor is malignant or benign comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- such a predetermined condition may be that the specificity of determining whether a solid tumor is malignant or benign comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- such a predetermined condition may be that the positive predictive value (PPV) of determining whether a solid tumor is malignant or benign comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- PSV positive predictive value
- such a predetermined condition may be that the negative predictive value (NPV) of determining whether a solid tumor is malignant or benign comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least Attorney Docket No.225234-718601/PCT about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- NSV negative predictive value
- such a predetermined condition may be that the area under the curve (AUC) of a Receiver Operating Characteristic (ROC) curve (AUROC) of determining whether a solid tumor is malignant or benign comprises a value of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99.
- AUC area under the curve
- AUROC Receiver Operating Characteristic
- such a predetermined condition may be that the area under the precision- recall curve (AUPRC) of determining whether a solid tumor is malignant or benign comprises a value of at least about 0.10, at least about 0.15, at least about 0.20, at least about 0.25, at least about 0.30, at least about 0.35, at least about 0.40, at least about 0.45, at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99.
- AUPRC precision- recall curve
- the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with a positive predictive value (PPV) of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- PSV positive predictive value
- the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with an area under the curve (AUC) of a Receiver Operating Characteristic (ROC) curve (AUROC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99.
- AUC area under the curve
- AUROC Receiver Operating Characteristic
- the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with an area under the precision-recall curve (AUPRC) of at least about 0.10, at least about 0.15, at least about 0.20, at least about 0.25, at least about 0.30, at least about 0.35, at least about 0.40, at least about 0.45, at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99.
- AUPRC area under the precision-recall curve
- FIG.11 shows a computer system 1101 that is programmed or otherwise configured to implement methods provided herein.
- the computer system 1101 can regulate various aspects of the present disclosure, such as, for example, assaying a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor, and electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor.
- the computer system 1101 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
- the electronic device can be a mobile electronic device.
- the computer system 1101 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1105, which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 1101 also includes memory or memory location 1110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1115 (e.g., hard disk), communication interface 1120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1125, such as cache, other memory, data storage and/or electronic display adapters.
- the memory 1110, storage unit 1115, interface 1120 and peripheral devices 1125 are in communication with the CPU 1105 through a communication bus (solid lines), such as a motherboard.
- the storage unit 1115 can be a data storage unit (or data repository) for storing data.
- the computer system 1101 can be operatively coupled to a computer network (“network”) 1130 with the aid of the Attorney Docket No.225234-718601/PCT communication interface 1120.
- the network 1130 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. [0434]
- the network 1130 in some cases is a telecommunication and/or data network.
- the network 1130 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- one or more computer servers may enable cloud computing over the network 1130 (“the cloud”) to perform various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, assaying a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor, and electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor.
- the cloud may enable cloud computing over the network 1130 (“the cloud”) to perform various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, assaying a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from the gene
- cloud computing may be provided by cloud computing platforms such as, for example, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and IBM cloud.
- the network 1130 in some cases with the aid of the computer system 1101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 1101 to behave as a client or a server.
- the CPU 1105 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 1110.
- the instructions can be directed to the CPU 1105, which can subsequently program or otherwise configure the CPU 1105 to implement methods of the present disclosure. Examples of operations performed by the CPU 1105 can include fetch, decode, execute, and writeback.
- the CPU 1105 can be part of a circuit, such as an integrated circuit. One or more other components of the system 1101 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
- ASIC application specific integrated circuit
- the storage unit 1115 can store files, such as drivers, libraries and saved programs.
- the storage unit 1115 can store user data, e.g., user preferences and user programs.
- the computer system 1101 in some cases can include one or more additional data storage units that are external to the computer system 1101, such as located on a remote server that is in communication with the computer system 1101 through an intranet or the Internet.
- the computer system 1101 can communicate with one or more remote computer systems through the network 1130.
- the computer system 1101 can communicate with a remote computer system of a user.
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
- the user can access the computer system 1101 via the network 1130.
- Attorney Docket No.225234-718601/PCT [0439] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1101, such as, for example, on the memory 1110 or electronic storage unit 1115.
- the machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 1105. In some cases, the code can be retrieved from the storage unit 1115 and stored on the memory 1110 for ready access by the processor 1105. In some situations, the electronic storage unit 1115 can be precluded, and machine-executable instructions are stored on memory 1110. [0440]
- the code can be pre-compiled and configured for use with a machine having a processor adapted to execute the code, or can be compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine- executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non- transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated Attorney Docket No.225234-718601/PCT during radio frequency (RF) and infrared (IR) data communications.
- Computer- readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 1101 can include or be in communication with an electronic display 1135 that comprises a user interface (UI) 1140.
- user interfaces include, without limitation, a graphical user interface (GUI) and web-based user interface.
- GUI graphical user interface
- the computer system can include a graphical user interface (GUI) configured to display, for example, subject data, identification of a solid tumor of the subject as a malignant solid tumor or a benign solid tumor, and/or predictions or assessments generated from subject data.
- GUI graphical user interface
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 1105.
- the algorithm can, for example, assay a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, analyze the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor, and electronically output a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor.
- a method for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy comprising: a) obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; Attorney Docket No.225234-718601/PCT b) training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a solid tumor is benign or malignant based at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or
- Embodiment 2 The method of embodiment 1, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor.
- Embodiment 3. The method of embodiment 1, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer.
- Embodiment 4. The method of embodiment 1, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer.
- Embodiment 5. The method of embodiment 1, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer.
- Embodiment 6 The method of embodiment 1, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer.
- Embodiment 7. The method of embodiment 1 or 3, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
- Embodiment 8. The method of embodiment 1 or 4, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer.
- Embodiment 10 The method of embodiment 1 or 6, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer.
- Embodiment 11 The method of embodiment 1 or 6, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer.
- a method for developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid Attorney Docket No.225234-718601/PCT tumor, ii) data regarding whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes,
- Embodiment 12 The method of embodiment 11, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor.
- Embodiment 13 The method of embodiment 11, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer.
- Embodiment 14 The method of embodiment 11, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer.
- Embodiment 15 The method of embodiment 11, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer.
- Embodiment 16 The method of embodiment 11, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer.
- Embodiment 17 The method of embodiment 11 or 13, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
- Attorney Docket No.225234-718601/PCT [0463]
- Embodiment 18 The method of embodiment 11 or 14, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer.
- Embodiment 19 Embodiment 19.
- Embodiment 11 or 15 wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer.
- Embodiment 20 The method of embodiment 11 or 16, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer.
- Embodiment 21 The method of any one of embodiments 11 to 20, wherein the A predictors have top 5 to 200 feature importance values.
- Embodiment 22 Embodiment 22.
- Embodiment 25 The method of any one of embodiments 11 to 23, wherein the trained machine learning model has a specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 25 Embodiment 25.
- Embodiment 27 The method of any one of embodiments 11 to 25, wherein the trained machine learning model has a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 27 Embodiment 27.
- ROC receiver operating characteristic
- AUC Area-Under-Curve
- the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a na ⁇ ve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof.
- Embodiment 29 Embodiment 29.
- a method for assessing a solid tumor of a patient comprising: a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of embodiment 1, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and d) electronically outputting a report classifying the solid tumor of the patient as the malignant or the benign solid tumor.
- PBMCs peripheral blood mononuclear cells
- Embodiment 30 The method of embodiment 29, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor.
- Embodiment 31 The method of embodiment 29, wherein the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set any one of embodiments 1, 3, or 7, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant.
- Embodiment 32 The method of embodiment 29, wherein the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set of embodiments 1, 4, or 8, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant.
- Embodiment 33 Embodiment 33.
- Embodiment 34 The method of embodiment 29, wherein the solid tumor is a kidney tumor, and the at least 2 genes selected are from the gene set of embodiments 1, 6, or 10, wherein the gene set is capable of classifying the kidney tumor as benign or malignant.
- Embodiment 35 The method of embodiment 29 or 31, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
- Embodiment 36 The method of embodiment 29 or 32, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer.
- Embodiment 37 The method of embodiment 29 or 33, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer.
- Embodiment 38 The method of embodiment 29 or 34, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer.
- Embodiment 39 Embodiment 39.
- Embodiment 40 The method of any one of embodiments 29 to 39, wherein the patient has cancer.
- Embodiment 41 The method of any one of embodiments 29 to 39, wherein the patient does not have cancer.
- Embodiment 42 The method of any one of embodiments 29 to 39, wherein the patient is at an elevated risk of having cancer.
- Embodiment 43 The method of any one of embodiments 29 to 39, and 42, wherein the patient is asymptomatic for cancer.
- Embodiment 44 Embodiment 44.
- Embodiment 45 The method of any one of embodiments 29 to 44, further comprising administering a treatment based on the patient’s solid tumor being classified as malignant.
- Embodiment 46 The method of embodiment 45, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof.
- Embodiment 47 The method of any one of embodiments 29 to 46, wherein the inference includes a confidence value between 0 and 1 that the solid tumor is malignant. Attorney Docket No.225234-718601/PCT [0493] Embodiment 48.
- Embodiment 49 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 49 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 29 to 48 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 29 to 49 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 29 to 50 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- any one of embodiments 29 to 51 comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- ROC receiver operating characteristic
- AUC Area-Under-Curve
- a method for treating cancer in a patient having a solid tumor comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more Attorney Docket No.225234-718601/PCT clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) administering a treatment based on the patient’s tumor being classified as a malignant tumor.
- PBMCs peripheral blood mononuclear cells
- Embodiment 55 The method of embodiment 54, wherein the cancer is a pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer.
- Embodiment 56 A system for assessing a solid tumor of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model
- PBMCs peripheral
- Embodiment 57 A non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or Attorney Docket No.225234-718601/PCT more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a PBMCs
- Embodiment 58 A method for obtaining a gene set capable of classifying whether a patient has cancer, the method comprising: (a) providing a dataset as an input to a machine learning classifier, said dataset comprises or is derived from gene expression measurements from a plurality of reference samples, of genes listed in a plurality of gene modules, wherein the plurality of reference samples comprises a first plurality of reference samples obtained or derived from reference subjects having cancer, and a second plurality of reference samples obtained or derived from reference subjects not having cancer; (b) performing feature selection to select a subset of gene modules from the plurality of gene modules, wherein the plurality of the gene modules form the features of the machine learning classifier; and (c) selecting differentially expressed genes from the genes listed in the subset of gene modules to obtain the gene set capable of classifying whether the patient has cancer.
- Embodiment 59 The method of embodiment 58, wherein the machine learning classifier is sequential grouped feature importance (SGFI) algorithm.
- Embodiment 60 The method of embodiment 58 or 59, wherein the feature selection comprises starting from a featureless model, and sequentially adding next best feature using leave-one-group-in importance (LOGI) until no further improvement in mean misclassification error (MMCE) over an improvement threshold is achieved.
- Embodiment 61 The method of embodiment 60, wherein the improvement threshold is 0.00001, 0.00005, 0.0001, 0.0005, or 0.001.
- Embodiment 62 The method of any one of embodiments 58 to 61, wherein the dataset is a batch corrected dataset.
- Embodiment 63 The method of any one of embodiments 58 to 62, wherein the plurality of gene modules are obtained by a method comprising: Attorney Docket No.225234-718601/PCT providing an initial data set comprising gene expression measurement from the plurality of reference samples, of genes of an initial gene-set; selecting M genes from the initial gene set, wherein said M genes are M variably expressed genes of the initial data set, and wherein M is an integer; and clustering the M genes to obtain the plurality of gene modules.
- Embodiment 64 The method of embodiment 63, wherein the M genes are clustered based on protein protein interaction of the proteins encoded by the M genes.
- Embodiment 65 Embodiment 65.
- Embodiment 66 The method of any one of embodiments 63 to 65, wherein M is 500 to 10000.
- Embodiment 67 The method of any one of embodiments 58 to 66, further comprising analyzing a patient data set comprising or derived from gene expression measurement of at least 2 genes selected from the genes within the gene set obtained in step (c) to classify whether a patient has cancer, wherein the gene expression measurement is obtained from a biological sample obtained or derived from the patient.
- Embodiment 68 Embodiment 68.
- the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 110, 120, 130, 140, 150, 151, 152, 153, 154, 155, 156, 157,
- Embodiment 69 The method of embodiment 67, wherein the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c).
- Embodiment 70 The method of any one of embodiments 67 to 69, wherein the method classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 71 Embodiment 71.
- Embodiment 73 The method of any one of embodiments 67 to 71, wherein the method classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 74 The method of any one of embodiments 67 to 72, wherein the method classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 75 The method of any one of embodiments 67 to 74, wherein analyzing the patient data set comprises providing the patient data set as an input to a machine-learning model trained to generate an inference indicative of whether the patient has cancer based on the patient data set.
- Embodiment 76 Embodiment 76.
- Embodiment 75 further comprising: a) receiving as an output of the machine-learning model the inference; and b) electronically outputting a report classifying whether the patient has cancer based on the inference.
- Embodiment 77 The method of embodiment 75 or 76, wherein the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof.
- LOG logistic regression
- Ridge regression Lasso regression
- elastic net (EN) regression support vector machine
- GBM gradient boosted machine
- kNN k nearest neighbors
- GLM generalized linear model
- Embodiment 78 The method of any one of embodiments 75 to 77, wherein the machine-learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer.
- ROC receiver operating characteristic
- AUC Area-Under-Curve
- Embodiment 80 The method of any one of embodiments 67 to 79, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer.
- Embodiment 81 The method of embodiment 80, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof.
- Embodiment 82 The method of any one of embodiments 58 to 81, wherein the cancer is a solid cancer.
- Embodiment 83 Embodiment 83.
- the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer.
- Embodiment 84 The method of any one of embodiments 58 to 81, wherein the cancer is a blood cancer.
- the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post- transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia.
- Embodiment 86 Embodiment 86.
- a method for classifying whether a patient has cancer comprising: providing a patient data set comprising or derived from gene expression measurements of at least 2 genes selected from genes within the gene set obtained in step (c) of any one of embodiments 58-66 as an input to a machine learning model trained to generate an inference of whether the patient data set is indicative of the patient having cancer; receiving, as an output of the machine-learning model the inference; and electronically outputting a report classifying whether the patient has cancer based on the inference, wherein the gene expression measurements are obtained from a biological sample obtained or derived from the patient.
- Attorney Docket No.225234-718601/PCT [0532] Embodiment 87.
- the patient dataset comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 11
- Embodiment 88 The method of embodiment 86, wherein the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c).
- Embodiment 89 The method of any one of embodiments 86 to 88, wherein the patient dataset is derived from the gene expression measurements using gene set variation analysis (GSVA), gene set enrichment analysis (GSEA), enrichment algorithm, multiscale embedded gene co-expression network analysis (MEGENA), weighted gene co-expression network analysis (WGCNA), differential expression analysis, Z-score, log2 expression analysis, or any combination thereof.
- GSVA gene set variation analysis
- GSEA gene set enrichment analysis
- MEGENA multiscale embedded gene co-expression network analysis
- WGCNA weighted gene co-expression network analysis
- differential expression analysis Z-score
- log2 expression analysis or any combination thereof.
- Embodiment 91 The method of any one of embodiments 86 to 90, wherein the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), na ⁇ ve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof.
- LOG logistic regression
- Ridge regression Lasso regression
- elastic net elastic net
- SVM support vector machine
- GBM gradient boosted machine
- kNN k nearest neighbors
- GLM generalized linear model
- NB na ⁇ ve Bayes
- NB na ⁇ ve Bayes
- neural network Random Forest
- RF Random Forest
- LDA linear discriminant analysis
- DTREE
- Embodiment 92 The method of any one of embodiments 86 to 91, wherein the method classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 93 Embodiment 93.
- Embodiment 95 The method of any one of embodiments 86 to 93, wherein the method classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 96 The method of any one of embodiments 86 to 94, wherein the method classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 97 The method of any one of embodiments 86 to 95, wherein the method classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
- Embodiment 98 The method of any one of embodiments 86 to 97, wherein the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof.
- PBMCs peripheral blood mononuclear cells
- Embodiment 99 The method of any one of embodiments 86 to 98, wherein the cancer is a solid cancer.
- Embodiment 100 The method of embodiment 99, wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, meta
- Embodiment 101 The method of any one of embodiments 86 to 98, wherein the cancer is a blood cancer. Attorney Docket No.225234-718601/PCT [0547] Embodiment 102. The method of embodiment 101, the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post- transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. [0548] Embodiment 103.
- Embodiment 104 The method of embodiment 103, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof.
- RNA-Seq Data was analyzed. Among those, 80 of the samples in the biomarker dataset had a diagnosis of a benign lung nodule, and 72 samples had a diagnosis of a malignant lung nodule. Gene expression measurements of whole blood samples from the subjects were analyzed using RNA-Seq technique.
- a training dataset comprising lung nodule samples from 604 subjects was used to train a machine learning algorithm. Gene expression measurements of whole blood samples from the subjects were analyzed. Subsequently, a validation dataset comprising samples of long noduless from 487 subjects were used to validate the machine learning algorithm. The samples were analyzed using RNA- Seq techniques.
- GBM Gradient boosting machines
- LOG Logistic regression model
- SVM Support vector machines
- RF Random forest
- GLM Generalized linear model
- kNN k-nearest neighbors
- NB Na ⁇ ve Bayes
- EN Elastic Networks
- the biomarker dataset comprised 80 lung nodule samples that had a diagnosis of a benign lung nodule and 72 samples that had a diagnosis of a malignant lung nodule.
- a total of 1,430 genes were initially identified to be differentially expressed between malignant lung nodule samples and benign lung nodule samples.
- a Log2 ratio of gene expression of the differentially expressed genes was used to determine the optimal set of genes. The Log2 ratio was defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample.
- FIG.1A is a receiver operating characteristic (ROC) plot showing performance of eight machine learning classifiers using a set of 1,178 gene features generated from ribonucleic acid (RNA) sequencing (RNA-Seq) data to distinguish malignant lung nodules versus benign lung nodules.
- RNA ribonucleic acid
- RNA-Seq ribonucleic acid sequencing
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.1B shows results of exemplary trained machine learning classifier algorithms to analyze RNASeq data using a set of 1,178 gene features to distinguish malignant lung nodules versus benign lung nodules. The corresponding data from the ROC plot of FIG.1A are tabulated in FIG.1B. The GBM, SVM, and EN classifiers were the most effective classifiers. [0557] A similar validation was performed using 75% of the dataset for training the classifiers and 25% of the dataset for validation.
- FIGs.2A-2B show results of a cross validation experiment when 75% of the dataset was considered for training the classifiers while 25% of the dataset was used for validation.
- FIG.2A is a ROC plot for an optimization of differentially expressed genes to distinguish malignant lung nodules versus benign lung nodules based on an analysis of RNA-Seq data.
- the six machine learning classifiers include LOG, GLM, kNN, RF, SVM, and GBM.
- FIG.2B shows results of exemplary trained machine learning classifier algorithms in an optimization of differentially expressed genes to distinguish malignant lung nodules versus benign lung nodules. The corresponding data from the ROC plot of FIG.2A are tabulated in FIG.2B.
- the GBM, SVM, and kNN classifiers were the most effective classifiers.
- the top 50 predictive genes from the 7 classifiers that accurately predicted lung nodules (FIGs.1A-1B) were combined. Furthermore, overlapping genes were removed, thereby yielding a gene set of 182 gene features (as shown in Table 1).
- Performance of the classifiers using only the 182 gene features as compared to the 1,178 gene features in predicting lung nodules were examined.
- FIG.3A is a ROC plot showing performance of seven machine learning classifiers using a set of 182 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- the corresponding data from the ROC plot of FIG.3A are tabulated in FIG.3B.
- FIG.3B Attorney Docket No.225234-718601/PCT shows results of exemplary trained machine learning classifier algorithms to analyze RNASeq data using the set of 182 gene features to distinguish malignant lung nodules versus benign lung nodules.
- Each cross validation dataset comprised 80% training data and 20% validation data. The results demonstrated that the 182 gene features effectively distinguished malignant lung nodules versus benign lung nodules. In general, use of the 182 genes was more effective than the entire set of 1,178 genes. Furthermore, the GBM and LOG machine learning classifiers achieved better predictive values when 182 gene features were used, as compared to the entire set of 1,178 gene features.
- the SVM model achieved a specificity decrease of about 0.05, yet overall performance of the SVM model improved, when the set of 182 gene features was used, as compared to the entire set of 1,178 gene features.
- the entire set of 1,178 genes was examined independently in male subjects and female subjects.
- the GBM machine learning classifier achieved the best predictive performance for male subject
- the NB machine learning classifier achieved the best predictive performance for female subjects, compared to other classifiers.
- a gene importance was calculated for each gene feature based on a gene feature from the GBM classifier for males, and the rank for the same gene feature in the NB classifier for females.
- FIG.4A shows the ROC plot of the performance of the classifiers using 175 genes over the entire dataset (males and females).
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.4B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.4A.
- FIG.4A shows the ROC plot of the eight classifiers’ performance using the 175 gene features with a 10-fold validation technique with 80% training and 20% validation split.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- the corresponding data from the ROC plot of FIG.5A are tabulated in FIG.5B.
- the GBM and SVM classifiers achieved the highest predictive values using the 175 gene features.
- MAP2K 9 2 Attorney Docket No.225234-718601/PCT EIF2AK ACTN4 CCDC94 4 HABP4 MED28 PDIA4 SEPT11 TMEM218 - L 3 6
- the set of 175 gene features and the set of 182 gene features had a total of shared 62 gene features which overlapped between the two sets.
- the 62 gene features were examined for their effectiveness in predicting lung nodules using the biomarkers dataset.10-fold cross validation with training to validation split of 75% and 25% was used.6B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.6A.
- FIG.6A is a ROC plot showing performance of machine learning classifiers using a set of the 62 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- the set of 62 gene features achieved high predictive value across all eight classifiers.
- FIG.7A is a ROC plot showing performance of machine learning classifiers using a set of 295 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.7B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.7A. All classifiers except GLM achieved high predictive values in classifying lung nodules using the biomarkers dataset.
- Feature selection was performed to reduce the set of features from 1,178 genes to one of (i) a set of 295 genes, (ii) a set of 182 genes, (iii) a set of 175 genes, or (iv) a set of 62 genes, which achieved positive results in distinguishing malignant lung nodules from benign Attorney Docket No.225234-718601/PCT lung nodules.
- larger datasets were investigated to compensate for heterogeneity in clinical data.
- the top 50 predictors from seven classifiers were selected and after removing overlapping genes, a set of 142 gene features (Table 5) were obtained.
- the seven classifiers included the eight classifiers other than the GLM.
- the training dataset was obtained using Ampli-Seq targeting the 175 genes determined previously.
- the training dataset comprised 301 lung nodule samples that were known to be benign and 303 samples that were diagnosed Attorney Docket No.225234-718601/PCT as malignant. Normalized Ampli-Seq read counts (RPM) of the 175 genes were provided as input data to the classifiers.
- RPM Normalized Ampli-Seq read counts
- Results of the eight classifiers in a 10-fold validation using a data split of 80% training data to 20% validation data is shown in FIGs.8A-8B.
- FIG.8A is a ROC plot showing performance of machine learning classifiers using a set of 175 gene features generated from Ampli-Seq data to distinguish malignant lung nodules versus benign lung nodules.
- the eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN.
- FIG.8B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.8A.
- a similar 10-fold validation was performed using a training to validation data split of 75% to 25%.
- Example 3 Machine Learning Classification and Validation using Ampli-Seq data [0578] The performance of the machine learning classifiers of Example 2 was validated using a dataset of lung nodule samples from 487 subjects. The validation dataset was obtained using Ampli-Seq targeting the set of 175 genes. The validation dataset comprised 142 lung nodule samples that were diagnosed as being malignant.
- FIG.9A is a cumulative fraction of lung nodules predicted by a logistic regression classifier using a set of 175 gene features.
- FIG.9B is a cumulative fraction of lung nodules predicted by a gradient boosting classifier using the set of 175 gene features.
- FIG.12 shows the correlation plot of the 8 clinical characteristics features (Table 6).
- Clinical Characteristics Attorney Docket No.225234-718601/PCT SEX (sex of the subject) Table 6: Clinical Ch [0583]
- Eight machine learning classifiers including Logistic regression model (LOG), Random forest (RF), Support vector machines (SVM), Decision tree learning (DTREE), Adaptive boosting (ADB), Na ⁇ ve Bayes (NB), Linear discriminant analysis (LDA), k-nearest neighbors (kNN), and Gradient boosting machines (GBM), were trained to distinguish malignant lung nodules versus benign lung nodules based on clinical characteristics data of the 8 clinical characteristics features (Table 6).
- FIG.13A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of the 8 clinical characteristics features (Table 6), to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the biomarkers dataset was used.
- AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM are 0.803, 0.782, 0.393, 0.618, 0.792, 0.806, 0.804, 0.750 and 0.764 respectively.
- FIG.13B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 8 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules.
- AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.703, 0.688, 0.351, 0.656, 0.720, 0.710, 0.699, 0.766 and 0.646 respectively.
- FIG.13C presents the tabulated results of the 9 machine learning classifiers corresponding to FIG.13A.
- FIG.13D presents feature importance of the 8 clinical characteristics features for the 9 machine learning classifiers.
- FIG.13E shows feature importance of the 8 clinical characteristics features for all the 9 classifiers.
- the three top most contributors or predictors or features were NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), and AGE, with the fourth being NCNMYN (Nodule Spiculated).
- NCNSZE nodule size
- NCNUPYN nodule in the upper lobe
- AGE AGE
- NCNMYN Nodule Spiculated
- FIG.14A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of the 4 clinical characteristics features, NCNSZE, NCNUPYN, AGE, and NCNMYN to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the biomarkers dataset was used.
- AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM are 0.858, 0.730, 0.840, 0.586, 0.736, 0.811, 0.862, 0.725 and 0.735 respectively.
- FIG.14B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 4 clinical characteristics features, NCNSZE, NCNUPYN, AGE, and NCNMYN, to distinguish malignant lung nodules versus benign lung nodules.
- AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.746, 0.703, 0.791, 0.626, 0.598, 0.695, 0.750, 0.653 and 0.689 respectively.
- FIG.14C presents the tabulated results of the 9 machine learning classifiers corresponding to FIG.14A.
- FIG.14D presents feature importance of the 4 clinical characteristics features for the 9 machine learning classifiers.
- FIG.14E shows feature importance of the 4 clinical characteristics features for all the 9 classifiers.
- performance of the classifiers when used top 4 predictors NCNSZE, NCNUPYN, AGE, and NCNMYN
- NCNSZE, NCNUPYN, AGE, and NCNMYN shows better performances than all 8 predictors (Table 6).
- FIG.15A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of the 9 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the larger dataset was used.
- AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM are 0.773, 0.745, 0.730, 0.661, 0.771, 0.786, 0.768, 0.654 and 0.757 respectively.
- FIG.15B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 9 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules.
- FIG.15C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.15A.
- FIG.15D shows feature importance of the 9 clinical characteristics features for the 9 machine learning classifiers.
- FIG.15E shows feature importance of the 9 clinical characteristics features for all the 9 models.
- Example 5 Machine Learning Classification using gene expression data and clinical characteristics data.
- NCNSZE nodule size
- NCNUPYN nodule in the upper lobe
- AGE AGE
- FIG.16A shows ROC plots showing performance of the 9 machine learning classifiers using gene expression data of the 142 gene features, and clinical characteristics data of the 3 clinical characteristics to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the combined dataset was used.
- AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM are 0.919, 0.819, 0.829, 0.660, 0.690, 0.783, 0.905, 0.826 and 0.795 respectively.
- FIG.16B shows Precision/Recall curve of the 9 machine learning classifiers using gene expression data of the 142 gene features, and clinical characteristics data of the 3 clinical characteristics to distinguish malignant lung nodules versus benign lung nodules.
- AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.854, 0.780, 0.756, 0.632, 0.619, 0.663, 0.754, 0.764 and 0.687 respectively.
- FIG.16C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.16A.
- FIG.16D presents the tabulated results of the 9 machine learning classifiers corresponding to FIG.16A, with oversampling correction applied (e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule).
- oversampling correction applied e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule.
- FIGs.16C and D relatively high predictive value can achieved using the set 142 gene features (Table 5), and a set of 3 clinical characteristics NCNSZE, NCNUPYN, and AGE as features.
- the top two contributors or predictors or features were nodule size and BCAT1 gene.
- Table 7 shows the top 34 predictors obtained from the machine learning classifier using the combined dataset of Example 5.
- Table 7 contains 31 lung- disease associated genes and 3 clinical characteristics (e.g. NCNSZE, NCNUPYN, and AGE).
- Predictors Attorney Docket No.225234-718601/PCT VPS37C AGE Table 7: Top 34 predictors from [0591] Next, the top 34 predictors were examined for their effectiveness in predicting lung nodules. A biomarker data set for the top 34 predictors were obtained from the 152 subjects. As described above, among those, 80 subjects had a diagnosis of a benign lung nodule, and 72 subjects had a diagnosis of a malignant lung nodule. The top 34 predictors contains 31 genes and NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), and AGE, as predictors.
- NCNSZE nodule size
- NCNUPYN nodule in the upper lobe
- AGE as predictors.
- FIG.17A shows ROC plots showing performance of the 9 machine learning classifiers using measurement data (e.g. gene expression data or clinical characteristics data as appropriate) of the 34 Attorney Docket No.225234-718601/PCT predictors to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the biomarkers dataset was used.
- AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM are 0.992, 0.867, 0.950, 0.675, 0.800, 0.854, 0.963, 0.835 and 0.842 respectively.
- FIG.17B shows Precision/Recall curve of the 9 machine learning classifiers using measurement data of the 34 predictors to distinguish malignant lung nodules versus benign lung nodules.
- AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.988, 0.807, 0.931, 0.687, 0.747, 0.815, 0.943, 0.814 and 0.811 respectively.
- FIG.17C presents the tabulated results of the machine learning classifiers LOG and RF corresponding to FIG.17A.
- FIG.17D presents the tabulated results of the 9 machine learning classifiers corresponding to FIG.17A, with oversampling correction applied (e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule).
- FIG.17E shows feature importance of the 34 features for all the 9 classifiers. As can be seen from FIGs. 17C and D relatively high predictive value can achieved using the 34 predictors containing the set of genes and clinical characteristics of Table 7.
- Example 6 Machine Learning Classification using gene expression data and clinical characteristics data. [0593] A combination of a set of 175 gene features (Table 2), and a set of 4 clinical characteristics features were examined for their effectiveness in predicting lung nodules. The 175 gene features were selected based on results of Examples 1, 2 and 3.
- NCNSZE nodule size
- NCNUPYN nodule in the upper lobe
- AGE nodule in the upper lobe
- NCNMYN Nodule Spiculated
- FIG.18A shows ROC plots showing performance of the 9 machine learning classifiers using gene expression data of the 175 gene features, and a clinical characteristics data of the 4 clinical characteristics to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the combined biomarkers dataset was used.
- AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM are 0.674, 0.698, 0.669, 0.702, 0.723, 0.657, 0.630, 0.560 and 0.784 respectively.
- FIG.18B shows Precision/Recall curve of the 9 machine learning classifiers using gene expression data of the 175 gene features, and a clinical characteristics data of the 4 clinical characteristics to distinguish malignant lung nodules versus benign lung nodules.
- AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.635, 0.724, 0.664, 0.727, 0.663, 0.630, 0.544, 0.550 and 0.729 respectively.
- FIG.18C shows the tabulated results of the 9 machine Attorney Docket No.225234-718601/PCT learning classifiers corresponding to FIG.18A.
- Table 8 shows the top 22 predictors obtained from the machine learning classifier using the combined dataset of Example 6.
- Predictors NCNSZE Table 8 Top 22 predictors from Exa Example 7: Machine Learning Classification of Pancreatic tumor
- Biomarker datasets containing gene expression measurements of whole blood samples from of a plurality of subjects, and clinical characteristics data of the plurality of subjects can be analyzed.
- the gene expression measurements of the whole blood samples can be analyzed using RNA-Seq technique.
- some of the subjects can have a diagnosis of a benign pancreatic tumor, and some other of the subjects can have a diagnosis of a malignant pancreatic tumor.
- One or more machine learning classifiers selected from Gradient boosting machines (GBM), Logistic regression Attorney Docket No.225234-718601/PCT model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Na ⁇ ve Bayes (NB), Elastic Networks (EN), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), Ridge regression, and Lasso regression, can be trained to distinguish malignant pancreatic tumors versus benign pancreatic tumors based on analysis of the RNA-Seq data, and clinical characteristics data.
- GBM Gradient boosting machines
- LOG Logistic regression Attorney Docket No.225234-718601/PCT model
- SVM Support vector machines
- RF Random forest
- a first group of genes were initially identified to be differentially expressed between samples from subjects containing malignant pancreatic tumors and samples from subjects containing benign pancreatic tumors.
- a Log2ratio of gene expression of the differentially expressed genes can be used to determine a first gene set containing a group of genes related to pancreatic cancer. The Log2 ratio is defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample.
- the first gene set can be obtained from the first group of genes after removing a subset of genes that exhibit collinear expression (for example correlation or r > 0.8).
- a first combined biomarker data set containing genes of the first gene set, and clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer, as features can be examined for their effectiveness in classifying pancreatic tumors.
- Performance of the machine learning classifiers using gene expression data of the genes of the first gene set, and clinical characteristics data of the clinical characteristics related to pancreatic cancer to distinguish malignant pancreatic tumors versus benign pancreatic tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the first combined dataset can be used.
- the feature importance for the classifiers can be determined. Based on feature importance values a first optimal predictor set containing a first optimal gene set and a first optimal clinical characteristics set can be selected.
- the first optimal predictor set can be obtained by combining top 50 predictors of the machine learning classifier being used, and removing the duplicating features.
- a second combined biomarker data set (from the plurality of subjects), containing genes of the first optimal gene set, and clinical characteristics of the first optimal clinical characteristics set, as features can be examined for their effectiveness in classifying pancreatic tumors. Performance of the machine learning classifiers using gene expression data of the genes of the first optimal gene set, and clinical characteristics data of the clinical characteristics of the first optimal clinical characteristics set, to distinguish malignant pancreatic tumors versus benign pancreatic tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the second combined dataset can be used.
- the Machine learning models can distinguish malignant pancreatic tumors versus benign pancreatic tumors with relatively high accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, using gene expression data of the genes of the first optimal gene set, and clinical characteristics data of the clinical characteristics of the first optimal clinical characteristics set.
- the first optimal gene set can be capable of classifying a pancreatic tumor as benign or malignant.
- Attorney Docket No.225234-718601/PCT Example 8 Machine Learning Classification of Ovarian tumor [0598] Differential gene expression analysis can be performed to identify genes that are most differentially expressed between benign ovarian tumors and malignant ovarian tumors.
- Biomarker datasets containing gene expression measurements of whole blood samples from of a plurality of subjects, and clinical characteristics data of the plurality of subjects can be analyzed.
- the gene expression measurements of the whole blood samples can be analyzed using RNA-Seq technique.
- some of the subjects can have a diagnosis of a benign ovarian tumor, and some other of the subjects can have a diagnosis of a malignant ovarian tumor.
- One or more machine learning classifiers selected from Gradient boosting machines (GBM), Logistic regression model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Na ⁇ ve Bayes (NB), Elastic Networks (EN), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), Ridge regression, and Lasso regression, can be trained to distinguish malignant ovarian tumors versus benign ovarian tumors based on analysis of the RNA-Seq data, and clinical characteristics data.
- GBM Gradient boosting machines
- LOG Logistic regression model
- SVM Support vector machines
- RF Random forest
- GLM Generalized linear model
- kNN k-nearest neighbors
- NB Na ⁇ ve Bayes
- EN Elastic Networks
- LDA linear discriminant analysis
- DTREE decision tree learning
- ADB adaptive boosting
- Ridge regression and Lasso regression
- a second group of genes were initially identified to be differentially expressed between samples from subjects containing malignant ovarian tumors and samples from subjects containing benign ovarian tumors.
- a Log2ratio of gene expression of the differentially expressed genes can be used to determine a second gene set, containing a group of genes related to ovarian cancer. The Log2 ratio is defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample.
- the second gene set can be obtained from the second group genes (after removing a subset of genes that exhibit collinear expression (for example correlation or r > 0.8).
- a first combined biomarker data set containing genes of the second gene set, and clinical characteristics selected from a group of clinical characteristics related to ovarian cancer, as features can be examined for their effectiveness in classifying ovarian tumors.
- Performance of the machine learning classifiers using gene expression data of the genes of the second gene set, and clinical characteristics data of the clinical characteristics related to ovarian cancer to distinguish malignant ovarian tumors versus benign ovarian tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the first combined dataset can be used.
- the feature importance for the classifiers can be determined. Based on feature importance values a second optimal predictor set containing a second optimal gene set and an optimal clinical characteristics set can be selected.
- the second optimal predictor set can be obtained by combining top 50 predictors of the machine learning classifier being used, and removing the duplicating features.
- a second combined biomarker data set (from the plurality of subjects), containing genes of the second optimal gene set, and clinical characteristics of the second optimal clinical characteristics set, as features can be examined for their effectiveness in classifying ovarian tumors.
- Performance of the machine learning classifiers using gene expression data of the genes of the second optimal gene set, and clinical characteristics data of the clinical characteristics of the second optimal clinical characteristics set, Attorney Docket No.225234-718601/PCT to distinguish malignant ovarian tumors versus benign ovarian tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the second combined dataset can be used.
- the Machine learning models can distinguish malignant ovarian tumors versus benign ovarian tumors with relatively high accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, using gene expression data of the genes of the second optimal gene set, and clinical characteristics data of the clinical characteristics of the second optimal clinical characteristics set.
- the second optimal gene set can be capable of classifying an ovarian tumor as benign or malignant.
- Example 9 Machine Learning Classification of Brain tumor [0601] Differential gene expression analysis can be performed to identify genes that are most differentially expressed between benign brain tumors and malignant brain tumors. Biomarker datasets containing gene expression measurements of whole blood samples from of a plurality of subjects, and clinical characteristics data of the plurality of subjects can be analyzed.
- the gene expression measurements of the whole blood samples can be analyzed using RNA-Seq technique.
- some of the subjects can have a diagnosis of a benign brain tumor, and some other of the subjects can have a diagnosis of a malignant brain tumor.
- One or more machine learning classifiers selected from Gradient boosting machines (GBM), Logistic regression model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Na ⁇ ve Bayes (NB), Elastic Networks (EN), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), Ridge regression, and Lasso regression, can be trained to distinguish malignant brain tumors versus benign brain tumors based on analysis of the RNA-Seq data, and clinical characteristics data.
- GBM Gradient boosting machines
- LOG Logistic regression model
- SVM Support vector machines
- RF Random forest
- GLM Generalized linear model
- kNN k-nearest neighbors
- NB Na ⁇ ve Bayes
- EN Elastic Networks
- LDA linear discriminant analysis
- DTREE decision tree learning
- ADB adaptive boosting
- Ridge regression and Lasso regression
- a Log2ratio of gene expression of the differentially expressed genes can be used to determine a third gene set, containing a group of genes related to brain cancer.
- the Log2 ratio is defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample.
- the third gene set can be obtained from the third group of genes after removing a subset of genes that exhibit collinear expression (for example correlation or r > 0.8).
- a first combined biomarker data set containing genes of the third gene set, and clinical characteristics selected from a group of clinical characteristics related to brain cancer, as features can be examined for their effectiveness in classifying brain tumors.
- Performance of the machine learning classifiers using gene expression data of the genes of the third gene set, and clinical characteristics data of the clinical characteristics related to brain cancer to distinguish malignant brain tumors versus benign brain tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the first combined dataset can be used.
- the feature importance for the classifiers can be determined. Based on feature importance values a third optimal predictor set containing a third optimal gene set and a third optimal clinical characteristics set can be selected.
- the third optimal predictor set can be obtained by Attorney Docket No.225234-718601/PCT combining top 50 predictors of the machine learning classifier being used, and removing the duplicating features.
- a second combined biomarker data set (from the plurality of subjects), containing genes of the third optimal gene set, and clinical characteristics of the third optimal clinical characteristics set, as features can be examined for their effectiveness in classifying brain tumors.
- Performance of the machine learning classifiers using gene expression data of the genes of the third optimal gene set, and clinical characteristics data of the clinical characteristics of the third optimal clinical characteristics set, to distinguish malignant brain tumors versus benign brain tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the second combined dataset can be used.
- the Machine learning models can distinguish malignant brain tumors versus benign brain tumors with relatively high accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, using gene expression data of the genes of the third optimal gene set, and clinical characteristics data of the clinical characteristics of the third optimal clinical characteristics set.
- the third optimal gene set can be capable of classifying a brain tumor as benign or malignant.
- Example 10 Machine Learning Classification of Kidney tumor [0604] Differential gene expression analysis can be performed to identify genes that are most differentially expressed between benign kidney tumors and malignant kidney tumors. Biomarker datasets containing gene expression measurements of whole blood samples from of a plurality of subjects, and clinical characteristics data of the plurality of subjects can be analyzed.
- the gene expression measurements of the whole blood samples can be analyzed using RNA-Seq technique.
- some of the subjects can have a diagnosis of a benign kidney tumor, and some other of the subjects can have a diagnosis of a malignant kidney tumor.
- One or more machine learning classifiers selected from Gradient boosting machines (GBM), Logistic regression model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Na ⁇ ve Bayes (NB), Elastic Networks (EN), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), Ridge regression, and Lasso regression, can be trained to distinguish malignant kidney tumors versus benign kidney tumors based on analysis of the RNA-Seq data, and clinical characteristics data.
- GBM Gradient boosting machines
- LOG Logistic regression model
- SVM Support vector machines
- RF Random forest
- GLM Generalized linear model
- kNN k-nearest neighbors
- NB Na ⁇ ve Bayes
- EN Elastic Networks
- LDA linear discriminant analysis
- DTREE decision tree learning
- ADB adaptive boosting
- Ridge regression and Lasso regression
- a Log2ratio of gene expression of the differentially expressed genes can be used to determine a fourth gene set, containing a group of genes related to kidney cancer.
- the Log2 ratio is defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample.
- the fourth gene set can be obtained from the fourth group of genes after removing a subset of genes that exhibit collinear expression (for example correlation or r > 0.8).
- a first combined biomarker data set containing genes of the fourth gene set, and clinical characteristics selected from a group of clinical characteristics related to kidney cancer, as features can be examined for their Attorney Docket No.225234-718601/PCT effectiveness in classifying kidney tumors.
- Performance of the machine learning classifiers using gene expression data of the genes of the fourth gene set, and clinical characteristics data of the clinical characteristics related to kidney cancer to distinguish malignant kidney tumors versus benign kidney tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the first combined dataset can be used. The feature importance for the classifiers can be determined. Based on feature importance values a fourth optimal predictor set containing a fourth optimal gene set and a fourth optimal clinical characteristics set can be selected. The fourth optimal predictor set can be obtained by combining top 50 predictors of the machine learning classifier being used, and removing the duplicating features.
- a second combined biomarker data set (from the plurality of subjects), containing genes of the fourth optimal gene set, and clinical characteristics of the fourth optimal clinical characteristics set, as features can be examined for their effectiveness in classifying kidney tumors. Performance of the machine learning classifiers using gene expression data of the genes of the fourth optimal gene set, and clinical characteristics data of the clinical characteristics of the fourth optimal clinical characteristics set, to distinguish malignant kidney tumors versus benign kidney tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the second combined dataset can be used.
- the Machine learning models can distinguish malignant kidney tumors versus benign kidney tumors with relatively high accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, using gene expression data of the genes of the fourth optimal gene set, and clinical characteristics data of the clinical characteristics of the fourth optimal clinical characteristics set.
- the fourth optimal gene set can be capable of classifying a kidney tumor as benign or malignant.
- Example 11 Obtaining biomarker gene sets for classifying whether a patient has cancer [0607] An initial dataset containing gene expression measurement of genes of an initial gene set from a plurality of reference samples is obtained.
- the plurality of reference samples contains a first plurality of reference samples obtained or derived from subjects having cancer, and second plurality of reference samples obtained or derived from subjects not having cancer.2000 most variably expressed genes of the initial dataset are selected.
- the 2000 genes are clustered using PPI-based MCODE.
- the PPI-based MCODE gene clusters were used as feature inputs for SGFI algorithm. Multiple subsample iterations are run, and cluster sets best classified subjects having cancer, from subject not having cancer are selected. For each cluster set (e.g., feature set), a radial SVM model is created and hyperparameters are tuned.10- fold CV is performed and feature set having highest F1-score is selected. Differentially expressed genes from the selected feature set is selected to obtained the gene set capable of classifying whether a patient has cancer.
- the first plurality of reference samples are obtained or derived from subjects having kidney cancer, and second plurality of reference samples obtained or derived from subjects not having kidney cancer, and the gene set obtained is capable of classifying whether a patient has kidney cancer.
- Attorney Docket No.225234-718601/PCT [0609]
- the first plurality of reference samples are obtained or derived from subjects having brain cancer, and second plurality of reference samples obtained or derived from subjects not having brain cancer, and the gene set obtained is capable of classifying whether a patient has brain cancer.
- the first plurality of reference samples are obtained or derived from subjects having ovarian cancer, and second plurality of reference samples obtained or derived from subjects not having ovarian cancer, and the gene set obtained is capable of classifying whether a patient has ovarian cancer.
- the first plurality of reference samples are obtained or derived from subjects having pancreatic cancer, and second plurality of reference samples obtained or derived from subjects not having pancreatic cancer, and the gene set obtained is capable of classifying whether a patient has pancreatic cancer.
- the first plurality of reference samples are obtained or derived from subjects having lung cancer, and second plurality of reference samples obtained or derived from subjects not having lung cancer, and the gene set obtained is capable of classifying whether a patient has lung cancer.
- Example 12 Obtaining biomarker gene sets for classifying whether a patient has lung cancer.
- Method 1 Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) is carried out on transcriptomic profiles of lung cancer subjects. MEGENA generated co-expression modules, significantly correlated to clinical feature “Diagnosis” are used as features for sequential grouped feature importance (SGFI) algorithm. The SGFI identifies the best combination of features that can distinguish the malignant from benign lung cancer samples.
- SGFI sequential grouped feature importance
- the model starts with null model and then adds the next best feature sequentially in a leave-one-group fashion until no improvement in the model metrics are observed.
- MEGENA modules that are as significantly correlated to diagnosis clinical variable are identified.
- the SGFI algorithm identifies best combination of feature groups among the identified MEGENA modules that can best classify the malignant from benign lung cancer nodules.
- the best feature groups identified by SGFI are plugged in as final features and machine learning classifiers were built to distinguish the malignant from benign cancer samples.
- the MEGENA and SGFI are implemented in R.
- Method 2 Differential Gene Expression (DEG) analysis is performed between the malignant and benign lung cancer nodules using limma function in R.
- DEG Differential Gene Expression
- the significant DE genes with (FDR pval ⁇ 0.05) are used as features for SGFI algorithm to identify the best combination of features that can classify the malignant from benign cancer samples with high accuracy.
- Method 3 Differential Gene Expression (DEG) analysis is performed between the malignant and benign lung cancer nodules using limma function in R. MCODE was performed to on the significant DE genes (FDR pval ⁇ 0.05) to identify the protein-protein interactions networks.
- the PPI based Attorney Docket No.225234-718601/PCT MCODE clusters are used as features for SGFI algorithm to identify the best combination of feature groups that classify the malignant from benign lung cancer samples with high accuracy.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Genetics & Genomics (AREA)
- Biomedical Technology (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Pure & Applied Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Algebra (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The present disclosure provides systems and methods for machine learning classification of solid tumor based on gene expression data and optionally clinical characteristics data. The method can include: a) obtaining a data set containing gene expression measurements of a biological sample from a patient of at least two lung disease-associated genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and optionally clinical characteristics data of one or more clinical characteristics of the patient; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and d) electronically outputting a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
Description
Attorney Docket No.225234-718601/PCT MACHINE LEARNING CLASSIFICATION OF SOLID TUMORS BASED ON GENE EXPRESSION CROSS-REFERENCE [0001] This application claims the benefit of priority to U.S. Provisional Application No.63/539,073, filed on September 18, 2023, which is incorporated herein by reference in its entirety. BACKGROUND [0002] Early diagnosis of cancer can be critical for successful treatment. Often cancer diagnosis requires biopsy of a solid tumor to determine whether it is malignant or benign. Certain tumors including lung, pancreatic, ovarian, kidney and brain, require collecting biopsy samples from the patients through invasive and/or painful surgery. The risks of such procedures include tissue and organ damage. For example, brain tumor biopsy can cause brain injury due to the removal of brain tissue, nerve damage from surgical tools used during tissue removal, and seizures due to the resulting scars left on the brain. Lung nodules are common, often detected in screenings of patients experiencing no symptoms of lung disease. Among subjects having lung nodules, only a fraction are eventually diagnosed with a cancer. Noncancerous causes of lung nodules can include e.g., mycobacterial or fungal infection, autoimmune diseases, air pollutants, and scarring from previous insult. Large lung nodules typically warrant an invasive biopsy or removal by thoracic surgery. The percentage of lung nodules eventually identified as cancerous has been estimated to be as low as 40%. Given the potential harm of biopsy or thoracic surgery, less invasive testing for lung cancer is needed. A simple noninvasive test for such solid tumors, e.g., a blood test, would greatly reduce both the potential for patient harm and medical costs. SUMMARY [0003] The present disclosure provides systems and methods for assessing a solid tumor of a patient using machine learning and uses thereof. In some embodiments, methods and systems for assessing a solid tumor of a patient by classifying gene expression with machine learning. Certain Embodiments [0004] The present disclosure provides systems and methods for assessing a solid tumor of a patient using machine learning and uses thereof. [0005] Provided herein are methods for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy, the methods comprising: a) obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii)
Attorney Docket No.225234-718601/PCT data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a solid tumor is benign or malignant based at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics; c) determining feature importance values of the plurality of genes; and d) determining the gene set based at least in part on the feature importance values. [0006] In some embodiments, the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor; (ii) a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer; (iii) an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer; (iv) a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer; (v) a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer; (vi) a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer; (vii) an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer; (viii) a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer; or (ix) a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. [0007] Provided herein are methods for developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant, the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is benign or malignant,
Attorney Docket No.225234-718601/PCT based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics; (c) determining feature importance values of the one or more predictors of the first machine learning model; (d) selecting A predictors of the first machine learning model based at least in part on the feature importance values, wherein A is an integer from 5 to 2000; and (e) training a second machine learning model based at least in part on a second reference data set comprising a plurality of second individual reference data sets, wherein a respective second individual reference data set of the plurality of second individual reference data sets comprises i) measurement data of the A predictors of the reference subject, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, to obtain the trained machine learning model, wherein the trained machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on measurement data of the A predictors. [0008] In some embodiments, the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor; (ii) a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer; (iii) an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer; (iv) a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer; (v) a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer; (vi) a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer; (vii) an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer; (viii) a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer; (ix) a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. [0009] In some embodiments, the A predictors have top 5 to 200 feature importance values. [0010] In some embodiments, the trained machine learning model has: (i) an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) a sensitivity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) a specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iv) a positive predictive
Attorney Docket No.225234-718601/PCT value at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (v) a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; or (vi) a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0011] In some embodiments, the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof. [0012] Provided herein are methods for assessing a solid tumor of a patient, the method comprising: a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set described herein, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and d) electronically outputting a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. [0013] In some embodiments, the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor. [0014] In some embodiments, the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set described herein, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
Attorney Docket No.225234-718601/PCT [0015] In some embodiments, the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set described herein, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. [0016] In some embodiments, the solid tumor is a brain tumor, and the at least 2 genes selected are from the gene set described herein, wherein the gene set is capable of classifying the brain tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. [0017] In some embodiments, the solid tumor is a kidney tumor, and the at least 2 genes selected are from the gene set described herein, wherein the gene set is capable of classifying the kidney tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. [0018] In some embodiments, the machine-learning model is trained according to any of the methods described herein. [0019] In some embodiments, the method described herein, wherein: (i) the patient has cancer; (ii) the patient does not have cancer; (iii) the patient is at an elevated risk of having cancer; or (iv) the patient is asymptomatic for cancer; optionally wherein the cancer is a pancreatic cancer, an ovarian cancer, or a brain cancer. [0020] In some embodiments, the method described herein, further comprising administering a treatment based on a solid tumor of a patient being classified as malignant. [0021] In some embodiments, the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0022] In some embodiments, the inference includes a confidence value between 0 and 1 that the solid tumor is malignant. [0023] In some embodiments, the method described herein, comprising: (i) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.; (iv) classifying the solid tumor of the
Attorney Docket No.225234-718601/PCT patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; or (v) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0024] In some embodiments, a machine learning model is trained and has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0025] Provided herein are methods for treating cancer in a patient having a solid tumor, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set described herein, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) administering a treatment based on a solid tumor of a patient being classified as a malignant tumor. [0026] In some embodiments, the cancer is a pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer. [0027] Provided herein are systems for assessing a solid tumor of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to:
Attorney Docket No.225234-718601/PCT obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set described herein, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide a dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine learning model, the inference indicating whether a composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. [0028] Provided herein are non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set described herein, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide a data set as an input to a machine learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor; receive, as an output of the machine learning model, the inference indicating whether a composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. [0029] Provided herein are methods for obtaining a gene set capable of classifying whether a patient has cancer, the method comprising: (a) providing a dataset as an input to a machine learning classifier, said dataset comprises or is derived from gene expression measurements from a plurality of reference samples, of genes listed in a plurality of gene modules, wherein the plurality of reference samples comprises a first plurality of reference samples obtained or derived from reference subjects having
Attorney Docket No.225234-718601/PCT cancer, and a second plurality of reference samples obtained or derived from reference subjects not having cancer; (b) performing feature selection to select a subset of gene modules from the plurality of gene modules, wherein the plurality of gene modules form features of the machine learning classifier; and (c) selecting differentially expressed genes from the genes listed in the subset of gene modules to obtain the gene set capable of classifying whether the patient has cancer. [0030] In some embodiments, the machine learning classifier is sequential grouped feature importance (SGFI) algorithm. [0031] In some embodiments, feature selection comprises starting from a featureless model, and sequentially adding next best feature using leave-one-group-in importance (LOGI) until no further improvement in mean misclassification error (MMCE) over an improvement threshold is achieved. [0032] In some embodiments, the improvement threshold is 0.00001, 0.00005, 0.0001, 0.0005, or 0.001. [0033] In some embodiments, the dataset is a batch corrected dataset. [0034] In some embodiments, the plurality of gene modules are obtained by a method comprising: providing an initial data set comprising gene expression measurement from the plurality of reference samples, of genes of an initial gene set; selecting M genes from the initial gene set, wherein said M genes are M variably expressed genes of the initial data set, and wherein M is an integer; and clustering the M genes to obtain the plurality of gene modules. [0035] In some embodiments, the M genes are clustered based on protein-protein interaction of proteins encoded by the M genes. [0036] In some embodiments, the M genes are M most variably expressed genes of the initial data set. [0037] In some embodiments, M is 500 to 10000. [0038] In some embodiments, the method described herein, further comprising analyzing a patient data set comprising or derived from gene expression measurement of at least 2 genes selected from the genes within the gene set obtained in step (c) to classify whether a patient has cancer, wherein the gene expression measurement is obtained from a biological sample obtained or derived from the patient. [0039] In some embodiments, the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 110, 120, 130, 140, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172,
Attorney Docket No.225234-718601/PCT 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, or all genes selected from the genes within the gene set obtained in step (c). [0040] In some embodiments, the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c). [0041] In some embodiments, the method described herein, wherein the method: (i) classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iv) classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; or (v) classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0042] In some embodiments, analyzing the patient data set comprises providing the patient data set as an input to a machine-learning model trained to generate an inference indicative of whether the patient has cancer based on the patient data set. [0043] In some embodiments, the method described herein, further comprising: a) receiving as an output of the machine-learning model the inference; and b) electronically outputting a report classifying whether the patient has cancer based on the inference. [0044] In some embodiments, the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof.
Attorney Docket No.225234-718601/PCT [0045] In some embodiments, the machine-learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer. [0046] In some embodiments, the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof. [0047] In some embodiments, the method described herein, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer. [0048] In some embodiments, the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof. [0049] In some embodiments, the cancer is a solid cancer; optionally wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer. [0050] In some embodiments, the cancer is a blood cancer; optionally wherein the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post-transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. [0051] Provided herein are methods for classifying whether a patient has cancer, the method comprising: providing a patient data set comprising or derived from gene expression measurements of at least 2 genes selected from genes within the gene set obtained in step (c) described herein as an input to a machine learning model trained to generate an inference of whether the patient data set is indicative of the patient having cancer; receiving, as an output of the machine learning model the inference; and electronically outputting a report classifying whether the patient has cancer based on the inference, wherein the gene expression measurements are obtained from a biological sample obtained or derived from the patient. [0052] In some embodiments, the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
Attorney Docket No.225234-718601/PCT 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200 or all genes selected from the genes within the gene set obtained in step (c) described herein. [0053] In some embodiments, the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c). [0054] In some embodiments, the patient data set is derived from the gene expression measurements using gene set variation analysis (GSVA), gene set enrichment analysis (GSEA), enrichment algorithm, multiscale embedded gene co-expression network analysis (MEGENA), weighted gene co-expression network analysis (WGCNA), differential expression analysis, Z-score, log2 expression analysis, or any combination thereof. [0055] In some embodiments, the patient data set is derived from the gene expression measurements using GSVA. [0056] In some embodiments, the machine learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof. [0057] In some embodiments, the method classifies whether the patient has cancer: (i) with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iv) with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (v) with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about
Attorney Docket No.225234-718601/PCT 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.. [0058] In some embodiments, the machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer. [0059] In some embodiments, the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof. [0060] In some embodiments, the cancer is a solid cancer; optionally wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer. [0061] In some embodiments, the cancer is a blood cancer; optionally wherein the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post-transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. [0062] In some embodiments, the method described herein, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer. [0063] In some embodiments, the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof. INCORPORATION BY REFERENCE [0064] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material. BRIEF DESCRIPTION OF THE DRAWINGS [0065] The novel features of the disclosure are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present disclosure will be obtained by
Attorney Docket No.225234-718601/PCT reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the disclosure are utilized, and the accompanying drawings of which: [0066] FIG.1A is a receiver operating characteristic (ROC) plot showing performance of eight machine learning classifiers using a set of 1,178 gene features generated from ribonucleic acid (RNA) sequencing (RNA-Seq) data to distinguish malignant lung nodules versus benign lung nodules. The 1,178 genes were differentially expressed in blood samples of patients with malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0067] FIG.1B shows results of exemplary trained machine learning classifier algorithms to analyze RNA Seq data using the set of 1,178 gene features to distinguish malignant lung nodules versus benign lung nodules. [0068] FIG.2A is a ROC plot for an optimization of differentially expressed genes to distinguish malignant lung nodules versus benign lung nodules based on an analysis of RNA-Seq data. The six machine learning classifiers include LOG, GLM, kNN, RF, SVM, and GBM. [0069] FIG.2B shows results of exemplary trained machine learning classifier algorithms in the FIG. 2A optimization of differentially expressed genes to distinguish malignant lung nodules versus benign lung nodules. [0070] FIG.3A is a ROC plot showing performance of eight machine learning classifiers using a set of 182 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0071] FIG.3B shows results of exemplary trained machine learning classifier algorithms to analyze RNASeq data using the set of 182 gene features to distinguish malignant lung nodules versus benign lung nodules. [0072] FIG.4A is a ROC plot showing performance of machine learning classifiers using a set of 182 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0073] FIG.4B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.4A. [0074] FIG.5A is a ROC plot showing performance of eight machine learning classifiers using a set of 175 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0075] FIG.5B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.5A.
Attorney Docket No.225234-718601/PCT [0076] FIG.6A is a ROC plot showing performance of machine learning classifiers using a set of 62 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0077] FIG.6B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.6A. [0078] FIG.7A is a ROC plot showing performance of machine learning classifiers using a set of 295 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0079] FIG.7B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.7A. [0080] FIG.8A is a ROC plot showing performance of machine learning classifiers using a set of 175 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0081] FIG.8B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.8A. [0082] FIG.9A is a cumulative fraction of lung nodules predicted by a logistic regression classifier using a set of 175 gene features. [0083] FIG.9B is a cumulative fraction of lung nodules predicted by a gradient boosting classifier using a set of 175 gene features. [0084] FIG.10 illustrates an overview of an example method 1000 for assessing a solid tumor of a subject. [0085] FIG.11 shows a computer system 1101 that is programmed or otherwise configured to implement methods provided herein. [0086] FIG.12 shows the correlation plot of the 8 clinical characteristics features listed in Table 6. [0087] FIG.13A-E: FIG.13A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of the 8 clinical characteristics features listed in Table 6, to distinguish malignant lung nodules versus benign lung nodules (in 152 patients). FIG.13B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 8 clinical characteristics features (Table 6), to distinguish malignant lung nodules versus benign lung nodules. FIG.13C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.13A. FIG.13D shows feature importance of the 8 clinical characteristics features (Table 6) for the 9 machine learning classifiers. FIG.13E shows feature importance of the 8 clinical characteristics features for all the 9 classifiers. The 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM.
Attorney Docket No.225234-718601/PCT [0088] FIG.14A-E: FIG.14A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of 4 clinical characteristics features, NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), AGE, and NCNMYN (Nodule Spiculated), to distinguish malignant lung nodules versus benign lung nodules. FIG.14B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 4 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules. FIG.14C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.14A. FIG.14D shows feature importance of the 4 clinical characteristics features for the 9 machine learning classifiers. FIG.14E shows feature importance of the 4 clinical characteristics features for all the 9 classifiers. The 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM. [0089] FIG.15A-E: FIG.15A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of 9 clinical characteristics features (8 features in Table 6 and cancer history) to distinguish malignant lung nodules versus benign lung nodules. FIG.15B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 9 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules. FIG. 15C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.15A. FIG. 15D shows feature importance of the 9 clinical characteristics features for the 9 machine learning classifiers. FIG.15E shows feature importance of the 9 clinical characteristics features for all the 9 classifiers. The 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM. [0090] FIG.16A-D: FIG.16A shows ROC plots showing performance of the 9 machine learning classifiers using gene expression data of the 142 gene features (Table 5), and a clinical characteristics data of 3 clinical features (NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), and AGE), to distinguish malignant lung nodules versus benign lung nodules. FIG.16B shows Precision/Recall curve of the 9 machine learning classifiers using gene expression data of the 142 gene features, and a clinical characteristics data of 3 clinical features, to distinguish malignant lung nodules versus benign lung nodules. FIG.16C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.16A. FIG.16D shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.16A with oversampling correction applied (e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule). The 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM. [0091] FIG.17A-E: FIG.17A shows ROC plots showing performance of the 9 machine learning classifiers using measurement data of the 34 predictors (Table 7), to distinguish malignant lung nodules versus benign lung nodules. FIG.17B shows Precision/Recall curve of the 9 machine learning classifiers using measurement data of the 34 predictors, to distinguish malignant lung nodules versus benign lung nodules. FIG.17C shows the tabulated results of the machine learning classifiers LOG and RF corresponding to FIG.17A. FIG.17D shows the tabulated results of the 9 machine learning classifiers
Attorney Docket No.225234-718601/PCT corresponding to FIG.17A, with oversampling correction applied (e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule). FIG.17 E shows feature importance of the 34 clinical characteristics features for all the 9 classifier. The 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM. FIG.18A-C: FIG.18A shows ROC plots showing performance of the 9 machine learning classifiers using gene expression data of the 175 gene features (Table 2), and a clinical characteristics data of 4 clinical features (NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), AGE, and NCNMYN (Nodule Spiculated), to distinguish malignant lung nodules versus benign lung nodules. FIG.18B shows Precision/Recall curve of the 9 machine learning classifiers using gene expression data of the 175 gene features, and a clinical characteristics data of 4 clinical features, to distinguish malignant lung nodules versus benign lung nodules. FIG.18C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.18A. The 9 machine learning classifiers are LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM. DETAILED DESCRIPTION Definitions [0092] Various terms used throughout the present description may be read and understood as follows, unless the context indicates otherwise: “or” as used throughout is inclusive, as though written “and/or”; singular articles and pronouns as used throughout include their plural forms, and vice versa; similarly, gendered pronouns include their counterpart pronouns so that pronouns should not be understood as limiting anything described herein to use, implementation, performance, etc. by a single gender; “exemplary” should be understood as “illustrative” or “exemplifying” and not necessarily as “preferred” over other embodiments. [0093] Further definitions for terms may be set out herein; these may apply to prior and subsequent instances of those terms, as will be understood from a reading of the present description. [0094] Whenever the term “at least,” “greater than,” or “greater than or equal to” precedes the first numerical value in a series of two or more numerical values, the term “at least,” “greater than” or “greater than or equal to” applies to each of the numerical values in that series of numerical values. For example, greater than or equal to 1, 2, or 3 is equivalent to greater than or equal to 1, greater than or equal to 2, or greater than or equal to 3. [0095] Whenever the term “no more than,” “less than,” or “less than or equal to” precedes the first numerical value in a series of two or more numerical values, the term “no more than,” “less than,” or “less than or equal to” applies to each of the numerical values in that series of numerical values. For example, less than or equal to 3, 2, or 1 is equivalent to less than or equal to 3, less than or equal to 2, or less than or equal to 1.
Attorney Docket No.225234-718601/PCT [0096] The terms “subject,” “reference subject”, or “test subject” as used herein, generally refer to a human such as a patient. The subject may be a person (e.g., a patient) with a cancer, a benign solid tumor, or a malignant solid tumor; or a person that has been treated for a cancer, a benign solid tumor, or a malignant solid tumor; or a person that is being monitored for a cancer, a benign solid tumor, or a malignant solid tumor; or a person that is suspected of having a cancer, a benign solid tumor, or a malignant solid tumor; or a person that does not have or is not suspected of having a cancer, a benign solid tumor, or a malignant solid tumor. [0097] The term “patient,” as used herein, generally refers to a human patient. The patient may be a person with a cancer, a benign solid tumor, or a malignant solid tumor; or a person that has been treated for a cancer, a benign solid tumor, or a malignant solid tumor; or a person that is being monitored for a cancer, a benign solid tumor, or a malignant solid tumor; or a person that is suspected of having a cancer, a benign solid tumor, or a malignant solid tumor; or a person that does not have or is not suspected of having a cancer, a benign solid tumor, or a malignant solid tumor. The cancer can be pancreatic cancer, ovarian cancer, and brain cancer, and the solid tumor can be a pancreatic tumor, ovarian tumor, and brain tumor respectively. [0098] In some embodiments, the term “composite data set” or “composite” refers to or is associated with a data set comprising different data sets. In some embodiments, the composite data set comprises one or more data sets, each from different sources. Methods [0099] In an aspect, the present disclosure provides a method for assessing a solid tumor of a patient. The method can include, any one of, any combination of, or all of steps a, b, c and d. Step a can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, ii) optionally clinical characteristics data of one or more clinical characteristics of the patient. Step b, can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor. Step c, can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor. Step d, can include electronically outputting a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. The gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, and the like. [0100] The solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney, or brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor. In certain
Attorney Docket No.225234-718601/PCT embodiments, the solid tumor is a pancreatic tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to pancreatic cancer. In some particular embodiments, the solid tumor is a pancreatic tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to ovarian cancer. In some particular embodiments, the solid tumor is an ovarian tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the solid tumor is a kidney tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to kidney cancer. In some particular embodiments, the solid tumor is a kidney tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a kidney tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a brain tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the solid tumor is a brain tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to brain cancer. In some particular embodiments, the solid tumor is a brain tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a brain tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to brain cancer. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0101] In some embodiments, the solid tumor is a lung tumor, and the data set of step a, contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. In some embodiments, the at least two lung disease-
Attorney Docket No.225234-718601/PCT associated genes are selected from the group of genes listed in Table 4. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 1. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 2. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 3. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 5. In some embodiments, the at least two lung disease- associated genes are selected from the group of genes listed in Table 7. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 8. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142 genes selected from the group of genes listed in Table 5. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7. In some embodiments, the at least two lung disease- associated genes of step a, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3. In some embodiments, the at least two lung disease-
Attorney Docket No.225234-718601/PCT associated genes, e.g. as of step a, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8.. In some embodiments, the at least two lung disease-associated genes of step a, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4. These genes and those described herein are known to those of skill in the art, and described in the literature. Table A provides examples of Gene ID numbers for genes listed herein in the Tables, including Tables 7 and 8, as described in OMIM® - Online Mendelian Inheritance in Man (McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD) and in the National Center for Biotechnology Information gene database (NCBI, U.S. National Library of Medicine 8600 Rockville Pike, Bethesda MD, 20894 USA), each incorporated by reference herein in their entirety. [0102] Table A. Selected Genes Example Gene ID Numbers Entrez Gene ID Predictor OMIM No. (NCBI)
Attorney Docket No.225234-718601/PCT SLC35B3 610845 51000 TDRD9 617963 122402
[0103] In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics includes, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics includes size of the tumor (e.g. lung nodule). In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics of the patient includes age of the patient. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics includes presence of the nodule in the lung upper lobe. In some embodiments, the solid
Attorney Docket No.225234-718601/PCT tumor is a lung tumor, and the one or more clinical characteristics includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the at least two lung disease-associated genes of step a, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step a, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the at least two lung disease-associated genes of step a, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step a, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the solid tumor is a lung tumor, and the data set of step a, contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and the data set of step a, contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of the clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. [0104] The biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the biological sample is a blood sample or any derivative thereof. In some embodiments, the biological sample is PBMCs or any derivative thereof. In some embodiments, the biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof. [0105] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than
Attorney Docket No.225234-718601/PCT about 99%. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0106] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least
Attorney Docket No.225234-718601/PCT about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0107] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%,
Attorney Docket No.225234-718601/PCT at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of about 80 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0108] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a positive predictive value of at least about 50%, at
Attorney Docket No.225234-718601/PCT least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0109] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about
Attorney Docket No.225234-718601/PCT 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0110] The machine learning model, e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a receiver operating characteristic (ROC) curve with an AUC of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about
Attorney Docket No.225234-718601/PCT 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. The machine learning model, e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of about 0.8 to about 1. The machine learning model, e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.995, about 0.85 to about 1, about 0.9 to about 0.92, about 0.9 to about 0.94, about 0.9 to about 0.95, about 0.9 to about 0.96, about 0.9 to about 0.97, about 0.9 to about 0.98, about 0.9 to about 0.99, about 0.9 to about 0.995, about 0.9 to about 1, about 0.92 to about 0.94, about 0.92 to about 0.95, about 0.92 to about 0.96, about 0.92 to about 0.97, about 0.92 to about 0.98, about 0.92 to about 0.99, about 0.92 to about 0.995, about 0.92 to about 1, about 0.94 to about 0.95, about 0.94 to about 0.96, about 0.94 to about 0.97, about 0.94 to about 0.98, about 0.94 to about 0.99, about 0.94 to about 0.995, about 0.94 to about 1, about 0.95 to about 0.96, about 0.95 to about 0.97, about 0.95 to about 0.98, about 0.95 to about 0.99, about 0.95 to about 0.995, about 0.95 to about 1, about 0.96 to about 0.97, about 0.96 to about 0.98, about 0.96 to about 0.99, about 0.96 to about 0.995, about 0.96 to about 1, about 0.97 to about 0.98, about 0.97 to about 0.99, about 0.97 to about 0.995, about 0.97 to about 1, about 0.98 to about 0.99, about 0.98 to about 0.995, about 0.98 to about 1, about 0.99 to about 0.995, about 0.99 to about 1, or about 0.995 to about 1. The machine learning model, e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. The machine learning model, e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995. The machine learning model, e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. [0111] The inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the solid tumor is malignant. Higher confidence values may be correlated with a higher likelihood that the solid tumor is malignant. A malignant tumor may be characterized by or by having to ability to metastasize or grow invasively, which may be in contrast to benign tumor.
Attorney Docket No.225234-718601/PCT [0112] In some embodiments, the patient has a cancer. In some embodiments, the patient does not have cancer. In some embodiments, the patient is suspected of having a cancer. In some embodiments, the patient is at an elevated risk of having a cancer. In some embodiments, the patient is asymptomatic for a cancer. Cancer can be lung cancer, pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer. In some embodiments, the patient has pancreatic cancer. In some embodiments, the patient does not have pancreatic cancer. In some embodiments, the patient is suspected of having pancreatic cancer. In some embodiments, the patient is at an elevated risk of having a pancreatic cancer. In some embodiments, the patient is asymptomatic for pancreatic cancer. In some embodiments, the patient has ovarian cancer. In some embodiments, the patient does not have ovarian cancer. In some embodiments, the patient is suspected of having ovarian cancer. In some embodiments, the patient is at an elevated risk of having ovarian cancer. In some embodiments, the patient is asymptomatic for ovarian cancer. In some embodiments, the patient has kidney cancer. In some embodiments, the patient does not have kidney cancer. In some embodiments, the patient is suspected of having kidney cancer. In some embodiments, the patient is at an elevated risk of having a kidney cancer. In some embodiments, the patient is asymptomatic for kidney cancer. In some embodiments, the patient has brain cancer. In some embodiments, the patient does not have brain cancer. In some embodiments, the patient is suspected of having brain cancer. In some embodiments, the patient is at an elevated risk of having brain cancer. In some embodiments, the patient is asymptomatic for brain cancer. In some embodiments, the patient has a lung cancer. In some embodiments, the patient does not have lung cancer. In some embodiments, the patient is suspected of having a lung cancer. In some embodiments, the patient is at an elevated risk of having a lung cancer. In some embodiments, the patient is asymptomatic for a lung cancer. [0113] In certain embodiments, the method includes optionally performing biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In certain embodiments, the method includes optionally performing a biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. In some embodiments, a biopsy is performed. In some embodiments, a biopsy is not performed. The decision to perform a biopsy may be made by one of skill in the art, based on knowledge and experience, in view of the classification of the lung nodule of the patient as the malignant lung nodule or the benign lung nodule. The decision to perform a biopsy may depend in part on the confidence value of the inference. In certain embodiments, biopsy of the solid tumor of the patient is not performed. [0114] In some embodiments, the method further contains administering a treatment to the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In some embodiments, the method contains administering a treatment to the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. In some embodiments, the treatment is configured to treat a cancer of the patient. In some embodiments, the treatment is configured to reduce a severity of a cancer of the patient. In some embodiments, the treatment is configured to reduce a risk of having a cancer of the patient. The treatment can include one
Attorney Docket No.225234-718601/PCT or more treatments of cancer. The cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the treatment can be treatment for the lung, pancreatic, ovarian, or brain cancer respectively. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the pancreatic tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method comprises administering a treatment of pancreatic cancer to the patient based at least in part on the classification of the pancreatic tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the ovarian tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method include administering a treatment of ovarian cancer to the patient based at least in part on the classification of the ovarian tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the kidney tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method comprises administering a treatment of kidney cancer to the patient based at least in part on the classification of the kidney tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the brain tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method includes administering a treatment of brain cancer to the patient based at least in part on the classification of the brain tumor of the patient as malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the lung tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method comprises administering a treatment of lung cancer to the patient based at least in part on the classification of the lung tumor of the patient as the malignant tumor. The treatment can include one or more treatments of cancer. In some embodiments, the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0115] The machine learning model of step b, can be a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant. The machine-learning model, e.g. of step b, can generate the inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor, by comparing the data set to a reference data set. The machine-learning model can be trained using the reference data set according to the methods described herein. In some embodiments, the reference data set can contain gene expression measurements of a plurality of reference biological samples from a plurality of reference subjects having solid tumor, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, data regarding whether the solid tumors of the reference subjects are benign or malignant, and optionally clinical characteristics data of one or more clinical characteristics of the reference subjects. A first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor. In some embodiments, the reference data set contains a plurality of individual reference data sets. A respective individual reference data set of the plurality of individual reference data sets can contain i) gene expression measurements of a reference biological sample from a reference
Attorney Docket No.225234-718601/PCT subject having a reference solid tumor of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject. The plurality of individual reference data sets can be obtained from the plurality of reference subjects. In some embodiments, different individual reference data sets are obtained from different reference subjects. In some embodiments, each of the individual reference data set contains i) gene expression measurements of a reference biological sample from one reference subject of the at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, ii) data regarding whether the lung nodule of the one reference subject is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, and, wherein different individual reference data sets are obtained from different reference subjects. In certain embodiments, oversampling or undersampling correction can be made during training of the machine learning model. For example, if a reference data set includes a greater number of samples identified as benign and a relatively fewer number of samples identified as malignant, the malignant samples may be oversampled to produce a data set that has equal number of benign and malignant samples. The reference tumor, and solid tumor, can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor. In certain embodiments, the solid tumor is a pancreatic tumor, the method can classify a pancreatic tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the solid tumor is a pancreatic tumor, the method can classify a pancreatic tumor as malignant or a benign, and the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, the method can classify a pancreatic tumor as malignant or a benign, and the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristic data of one or more characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is ovarian tumor, the method can classify an ovarian tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the solid tumor is an ovarian tumor, the method can classify an ovarian tumor as malignant or a benign, and the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, the method can classify an ovarian tumor as malignant or a benign, and the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer.
Attorney Docket No.225234-718601/PCT In certain embodiments, the solid tumor is a kidney tumor, the method can classify a kidney tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the solid tumor is a kidney tumor, the method can classify a kidney tumor as malignant or a benign, and the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a kidney tumor, the method can classify a kidney tumor as malignant or a benign, and the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristic data of one or more characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a brain tumor, the method can classify a brain tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the solid tumor is a brain tumor, the method can classify a brain tumor as malignant or a benign, and the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, the method can classify a brain tumor as malignant or a benign, and the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In some embodiments, the genes of the data set and genes of the reference data set can at least partially overlap. In some embodiments, clinical characteristics of the data set and clinical characteristics of the reference data set can at least partially overlap. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0116] In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the gene set of reference data set is the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and one or more clinical characteristics of the reference data set is selected from the group of clinical characteristics listed in Table 6. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
Attorney Docket No.225234-718601/PCT 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7. In some embodiments, the solid tumor is a lung tumor, the at least two genes of the reference data set comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of the reference data set comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the at least two genes of the reference data set consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of the reference data set consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21, genes selected from the group of genes listed in Table 8. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set include, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set includes size of the nodule. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set
Attorney Docket No.225234-718601/PCT includes age of the patient. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set includes presence of the nodule in the lung upper lobe. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and the one or more clinical characteristics of the reference data set includes 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6 In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and the one or more clinical characteristics of the reference data set include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. The genes of the data set and genes of the reference data set can at least partially overlap and/or the optional clinical characteristics of the data set and optional clinical characteristics of the reference data set can at least partially overlap. [0117] The reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof. The reference subjects can be human. [0118] Gene expression data can be obtained by a data analysis tool selected from the group: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring ™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope). [0119] In some embodiments, the trained machine learning model, e.g. of step b, is a supervised machine learning algorithm or an unsupervised machine learning algorithm. In some embodiments, the trained machine learning model is trained using linear regression, logistic regression (LOG), Ridge
Attorney Docket No.225234-718601/PCT regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof. In some embodiments, the trained machine learning model is trained using LOG. In some embodiments, the trained machine learning model is trained using Ridge regression. In some embodiments, the trained machine learning model is trained using Lasso regression. In some embodiments, the trained machine learning model is trained using GLM. In some embodiments, the trained machine learning model is trained using kNN. In some embodiments, the trained machine learning model is trained using SVM. In some embodiments, the trained machine learning model is trained using GBM. In some embodiments, the trained machine learning model is trained using RF. In some embodiments, the trained machine learning model is trained using NB. In some embodiments, the trained machine learning model is trained using the EN regression. In some embodiments, the trained machine learning model is trained using neural network. In some embodiments, the trained machine learning model is trained using deep learning algorithm. In some embodiments, the trained machine learning model is trained using LDA. In some embodiments, the trained machine learning model is trained using DTREE. In some embodiments, the trained machine learning model is trained using ADB. [0120] In some embodiments, the method comprises determining a likelihood of the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In some embodiments, the likelihood is about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, or about 100%. In some embodiments, the likelihood is least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0121] In some embodiments, the method further comprises monitoring the solid tumor of the patient, wherein the monitoring comprises assessing the solid tumor of the patient at a plurality of time points. In some embodiments, a difference in the assessment of the solid tumor of the patient among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the solid tumor of the patient, (ii) a prognosis of the solid tumor of the patient, and (iii) an efficacy or non-efficacy of a course of treatment for treating the solid tumor of the patient. In some embodiments, the plurality of time points comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or 50 different time points. [0122] In an aspect, the present disclosure provides a method for determining a gene set capable of classifying a solid tumor, as benign or malignant. Gene expression measurements of one or more genes
Attorney Docket No.225234-718601/PCT of the gene set of a biological sample (e.g. blood) from a patient can be used to classify a solid tumor of the patient, as benign or malignant without performing biopsy of the solid tumor. In some embodiments, a biopsy of the solid tumor can be performed to confirm and/or follow-up the classification results obtained by using the gene expression measurement data. In some embodiments, a biopsy of the solid tumor is not performed. The method can include any one of, any combination of, or all of steps a’, b’, c’ and d’. In step a’, a reference data set can be obtained and/or provided. [0123] The reference data set can contain i) gene expression measurements of a plurality of genes of reference biological samples from reference subjects each having at least one reference solid tumor, ii) data regarding whether the reference tumors are benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subjects. The reference data set can contain a plurality of individual reference data sets. A respective individual reference data set of the plurality of individual reference data sets can contain i) gene expression measurements of the plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference tumor is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the reference subject. The plurality of individual reference data sets can be obtained from a plurality of reference subjects. In some embodiments, different individual reference data sets can be obtained from different reference subjects. In some embodiments, each of the individual reference data set contains i) gene expression measurements of the plurality of genes of a reference biological sample from one reference subject, ii) data regarding whether the solid tumor of the one reference subject is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, wherein different individual reference data sets are obtained from different reference subjects. A first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor. In step b’, a machine learning model can be trained using the reference data set to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics. In some embodiments, the machine learning model can be trained using a training data set containing a first portion of the reference data set, and a validation data set containing a second portion of the reference data set. In certain embodiments, oversampling or undersampling correction can be made during training of the machine learning model. For example, if a data set includes a greater number of samples identified as benign and a relatively fewer number of samples identified as malignant, the malignant samples may be oversampled to produce a data set that has equal number of benign and malignant samples. In step c’, feature importance values of the plurality of genes can be determined. In step d’, the gene set can be selected. The gene set can be selected as predictors that are used to train the machine learning model. The gene set, may be selected based at least in part on feature importance values. In some embodiments, the feature importance values of the genes of the gene set, are within top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100,
Attorney Docket No.225234-718601/PCT 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, feature importance values of the plurality of genes. In some embodiments, the feature importance of the genes of the gene set, can have accuracy, greater than 30 %, 35 %, 40 %, 45 %, 50 %, 55 %, 60 %, 65 %, 70 %, 75 %, or 80 % or 90 %. In some embodiments, the feature importance of the genes of the gene set, can have threshold importance, greater than 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80 or 90. In certain embodiments, the top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the machine learning model include the genes of the gene set. [0124] The solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney tumor, or brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor. The reference tumor, and solid tumor, can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 5. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 300, 400, 500, 600, 700, 800, 900, 1000, 1100 or 1178, or any value or range there between, genes selected from the group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100,
Attorney Docket No.225234-718601/PCT 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics of the reference data set of step a’, include 1, 2, 3, 4, 5, 6, 7, or 8, clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, the plurality of genes of the reference data set of step a’, include at least 2 genes selected from a group of genes listed in any one or more of Tables 1, 2, 3, 4, 5 and 9, and the one or more clinical characteristics of the reference data set of step a’, include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In certain embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to pancreatic cancer. In some particular embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to pancreatic cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to ovarian cancer. In certain embodiments, the solid
Attorney Docket No.225234-718601/PCT tumor is an ovarian tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to ovarian cancer. In some particular embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to ovarian cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to kidney cancer. In certain embodiments, the solid tumor is a kidney tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to kidney cancer. In some particular embodiments, the solid tumor is a kidney tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to kidney cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a brain tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, and the one or more clinical characteristics of the reference data of step a’, set are selected from a group of clinical characteristics related to brain cancer. In some particular embodiments, the solid tumor is a brain tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to brain cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to brain cancer. In some embodiments, genes having collinear expression with correlation coefficients (e.g. in non-limiting aspects > 0.7 to > 0.9) were removed from the reference data set. Collinear gene expression can be measured by any suitable technique, e.g. Pearson correlation coefficient. While determining feature importance values for the plurality of genes has been described, this is merely a non-limiting illustrative example of a technique that can be practiced. In various embodiments, one or more feature selection techniques are used to determine the gene set that can classify a solid tumor benign or malignant. Feature selection techniques can include least absolute shrinkage and selection operator (Lasso) regression, support vector machine (SVM), regularized trees, decision trees, memetic algorithm, random multinomial logit (RMNL), auto-encoding networks, submodular feature selection, recursive feature elimination, or any combination thereof. In some of these cases, feature importance values need not be calculated for each of the genes of the plurality of genes. The reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative
Attorney Docket No.225234-718601/PCT thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof. The reference subjects can be human. [0125] The machine learning model, e.g. of step b’, can be trained using a supervised machine learning algorithm or an unsupervised machine learning algorithm. In some embodiments, the machine learning model, e.g. of step b’, is trained using linear regression, logistic regression, Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof. In some embodiments, the machine learning model is trained using logistic regression. In some embodiments, the machine learning model is trained using Ridge regression. In some embodiments, the machine learning model is trained using Lasso regression. In some embodiments, the machine learning model is trained using GLM. In some embodiments, the machine learning model is trained using kNN. In some embodiments, the machine learning model is trained using SVM. In some embodiments, the machine learning model is trained using GBM. In some embodiments, the machine learning model is trained using RF. In some embodiments, the machine learning model is trained using NB. In some embodiments, the machine learning model is trained using the EN regression. In some embodiments, the machine learning model is trained using neural network. In some embodiments, the machine learning model is trained using deep learning algorithm. In some embodiments, the machine learning model is trained using LDA. In some embodiments, the machine learning model is trained using DTREE. In some embodiments, the machine learning model is trained using ADB. [0126] The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about
Attorney Docket No.225234-718601/PCT 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0127] The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about
Attorney Docket No.225234-718601/PCT 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0128] The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of about 80 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The
Attorney Docket No.225234-718601/PCT gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0129] The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %,
Attorney Docket No.225234-718601/PCT about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0130] The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %.
Attorney Docket No.225234-718601/PCT [0131] In another aspect, the present disclosure provides a method for developing a trained machine learning model capable of classifying a solid tumor of a patient, as benign or malignant. The method can include any one of, any combination of, or all of steps a”, b”, c”, d” and e”. Step a”, can include obtaining and/or providing a first reference data set. In some embodiments, the first reference data set can contain i) gene expression measurements of a plurality of genes of reference biological samples from reference subjects each having at least one reference solid tumor, ii) data regarding whether the reference tumors are benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subjects. The first reference data set can contain a plurality of first individual reference data sets. A respective first individual reference data set of the plurality of first individual reference data sets can contain i) gene expression measurements of the plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) clinical characteristics data of the one or more clinical characteristics of the reference subject. The plurality of first individual reference data sets can be obtained from a plurality of reference subjects. In some embodiments, different first individual reference data sets are obtained from different reference subjects. In some embodiments, each of the first individual reference data set contains i) gene expression measurements of the plurality of genes of a reference biological sample from one reference subject, iii) data regarding whether the reference solid tumor of the one reference subject is benign or malignant and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, wherein different first individual reference data sets are obtained from different reference subjects. A first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor. In step b”, a first machine learning model can be trained using the first reference data set to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics. The first machine learning model can be trained to infer whether the solid tumor is benign or malignant, based at least in part on the measurement data of the plurality of genes, and optionally the clinical characteristics data of the one or more clinical characteristics. In some embodiments, the first machine learning model can be trained using a training data set containing a first portion of the first reference data set, and a validation data set containing a second portion of the first reference data set. In step c’, feature importance values of one or more predictors of the first machine learning model can be determined. In step d’, A predictors of the first machine learning model based at least in part on the feature importance values can be selected, where A can be an integer from 3 to 2000, such as 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, 250, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, or 2000, or any integer value therein. In certain embodiments, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
Attorney Docket No.225234-718601/PCT 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the first machine learning model can be selected. In some embodiments, the A predictors can have top A feature importance values, for example, in a non-limiting aspect, A can be 10, and 10 predictors having 10 highest feature importance values can be selected. In some embodiments, the feature importance of the A predictors, can have an accuracy, greater than 30 %, 35 %, 40 %, 45 %, 50 %, 55 %, 60 %, 65 %, 70 %, 75 %, 80 % or 90%. In some embodiments, the feature importance of the A predictors, can have a threshold importance, greater than 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80 or 90. In some embodiments, the A predictors form top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the first machine learning model. A predictors can include one or more genes, and/or optionally one or more clinical characteristics. While determining feature importance values for the one or more predictors has been described, this is merely a non-limiting illustrative example of a technique that can be practiced. In various embodiments, one or more feature selection techniques are used to determine the A predictors. Feature selection techniques can include least absolute shrinkage and selection operator (Lasso) regression, support vector machine (SVM), regularized trees, decision trees, memetic algorithm, random multinomial logit (RMNL), auto-encoding networks, submodular feature selection, recursive feature elimination, or any combination thereof. In some of these cases, in step c”, feature importance values need not be calculated for each of the predictors of first machine learning model. Step e”, can include training a second machine learning model based at least in part on a second reference data set to obtain the trained machine learning model. The second reference data set can contain i) measurement data of the A predictors of the reference subjects, and ii) data regarding whether the solid tumors of the reference subjects are benign or malignant. The second reference data set can contain a plurality of second individual reference data sets. A respective second individual reference data set of the plurality of second individual reference data sets can include i) measurement data of the A predictors of a reference subject, and ii) data regarding whether the solid tumor of the reference subject is benign or malignant. The plurality of second individual reference data sets can be obtained from the plurality of reference subjects. In some embodiments, different second individual reference data sets are obtained from different reference subjects. In some embodiments, each of the second individual reference data set contains i) measurement data of the A predictors of one reference subject, and ii) data regarding whether the solid tumor of the one reference subject is benign or malignant, wherein different second individual reference data sets are obtained from different reference subjects. Measurement data of the A predictors can include, gene expression measurements in the reference sample of the one or more genes features of the A predictors, and/or optionally clinical characteristics data of one or more clinical characteristics features of the A predictors. The trained machine learning model can infer whether a solid tumor is benign or
Attorney Docket No.225234-718601/PCT malignant, based at least in part on measurement data of the A predictors. In some embodiments, the one or more genes features of the A predictors can form the gene set capable of classifying a solid tumor, as benign or malignant. In certain embodiments, oversampling or undersampling correction can be made during training of the first and/or second machine learning model. [0132] The solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor. The reference tumor, and solid tumor, can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from a group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 300, 400, 500, 600, 700, 800, 900, 1000, 1100 or 1178, or any value or range there between, genes selected from the group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 5. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics of the first reference data set include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from a group of genes listed in any one or more of Tables 1, 2, 3, 4, 5 and 9, and the one or more clinical characteristics of the first reference data set include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In certain embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to pancreatic cancer. In
Attorney Docket No.225234-718601/PCT some particular embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to pancreatic cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from pancreatic cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to ovarian cancer. In some particular embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to ovarian cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to kidney cancer. In certain embodiments, the solid tumor is a kidney tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to kidney cancer. In some particular embodiments, the solid tumor is a kidney tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to kidney cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from kidney cancer. In certain embodiments, the solid tumor is a brain tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to brain cancer. In some particular embodiments, the solid tumor is a brain tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to brain cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from brain cancer. In some embodiments, genes having collinear expression with correlation coefficients (e.g. in non-limiting aspects > 0.7 to > 0.9) were removed from the reference data set. Collinear gene expression can be measured by any suitable technique, e.g. Pearson correlation coefficient. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2, as predictors. In
Attorney Docket No.225234-718601/PCT some embodiments, the solid tumor is a lung tumor, and the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes of genes selected from the group listed in Table 7, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors can at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include i) at least, 2 genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8and ii) optionally size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors can include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33 or 34 predictors selected from the group listed in Table 7. In some embodiments, the solid tumor is a lung tumor, the A predictors comprise the 34 predictors listed in Table 7. In some embodiments, the solid tumor is a lung tumor, the A predictors consist the 34 predictors listed in Table 7. [0133] The reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof.
Attorney Docket No.225234-718601/PCT [0134] The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of about 80 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least
Attorney Docket No.225234-718601/PCT about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of about 80 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0136] The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is
Attorney Docket No.225234-718601/PCT benign or malignant with a specificity of about 80 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0137] The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a positive predictive value of about 80 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about
Attorney Docket No.225234-718601/PCT 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0138] The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a negative predictive value of about 80 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to
Attorney Docket No.225234-718601/PCT about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0139] The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a ROC curve with an AUC of about 0.8 to about 1. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.995, about 0.85 to about 1, about 0.9 to about 0.92, about 0.9 to about 0.94, about 0.9 to about 0.95, about 0.9 to about 0.96, about 0.9 to about 0.97,
Attorney Docket No.225234-718601/PCT about 0.9 to about 0.98, about 0.9 to about 0.99, about 0.9 to about 0.995, about 0.9 to about 1, about 0.92 to about 0.94, about 0.92 to about 0.95, about 0.92 to about 0.96, about 0.92 to about 0.97, about 0.92 to about 0.98, about 0.92 to about 0.99, about 0.92 to about 0.995, about 0.92 to about 1, about 0.94 to about 0.95, about 0.94 to about 0.96, about 0.94 to about 0.97, about 0.94 to about 0.98, about 0.94 to about 0.99, about 0.94 to about 0.995, about 0.94 to about 1, about 0.95 to about 0.96, about 0.95 to about 0.97, about 0.95 to about 0.98, about 0.95 to about 0.99, about 0.95 to about 0.995, about 0.95 to about 1, about 0.96 to about 0.97, about 0.96 to about 0.98, about 0.96 to about 0.99, about 0.96 to about 0.995, about 0.96 to about 1, about 0.97 to about 0.98, about 0.97 to about 0.99, about 0.97 to about 0.995, about 0.97 to about 1, about 0.98 to about 0.99, about 0.98 to about 0.995, about 0.98 to about 1, about 0.99 to about 0.995, about 0.99 to about 1, or about 0.995 to about 1. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. [0140] Gene expression data can be obtained by a data analysis tool selected from the group: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring ™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope). [0141] In some embodiments, the trained machine learning model is a supervised machine learning algorithm or an unsupervised machine learning algorithm. In some embodiments, first and/or second machine-learning model is independently trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof. In some embodiments, the first and/or second machine-learning model is independently trained using LOG. In some embodiments, the first and/or second machine-learning model is independently trained using Ridge regression. In some embodiments, the first and/or second machine-learning model is independently trained using Lasso regression. In some embodiments, the first and/or second machine-learning model is independently trained using GLM. In some embodiments, the first and/or second machine-learning model is independently trained using kNN. In some embodiments, the first and/or second machine-learning model is independently trained using SVM. In some embodiments, the first and/or second machine- learning model is independently trained using GBM. In some embodiments, the first and/or second
Attorney Docket No.225234-718601/PCT machine-learning model is independently trained using RF. In some embodiments, the first and/or second machine-learning model is independently trained using NB. In some embodiments, the first and/or second machine-learning model is independently trained using the EN regression. In some embodiments, the first and/or second machine-learning model is independently trained using neural network. In some embodiments, the first and/or second machine-learning model is independently trained using deep learning algorithm. In some embodiments, the first and/or second machine-learning model is independently trained using LDA. In some embodiments, the first and/or second machine-learning model is independently trained using DTREE. In some embodiments, the first and/or second machine-learning model is independently trained using ADB. [0142] In an aspect, the present disclosure provides a method for treating cancer in a patient. In some embodiments, the patient has a solid tumor. The method can include, any one of, any combination of, or all of steps a”’, b”’, c”’ and d”’. Step a”’, can include obtaining a data set containing i) gene expression measurements of a biological sample from the patient, of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient. The gene expression measurements can be obtained by assaying the biological sample. Step b’”, can include providing the data set as input to a machine- learning model trained to generate an inference of whether the data set is indicative of the patient having or not having cancer. In some embodiments, the inference infer whether the data set is indicative of the solid tumor of the patient is malignant or benign. Step c”’, can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the patient having or not having cancer. In some embodiments, the inference received as an output, indicate whether the solid tumor of the patient is malignant or the benign. Step d”’, can include administering a treatment based on the determination that the patient has cancer. In some embodiments, the treatment is be administering based on the patient’s solid tumor being classified as malignant. [0143] The cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the solid tumor can be a pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor respectively. In certain embodiments, the cancer is lung cancer, and the solid tumor is a lung tumor. In certain embodiments, the cancer is pancreatic cancer, and the solid tumor is a pancreatic tumor. In certain embodiments, the cancer is ovarian cancer, and the solid tumor is an ovarian tumor. In certain embodiments, the cancer is kidney cancer, and the solid tumor is a kidney tumor. In certain embodiments, the cancer is brain cancer, and the solid tumor is a brain tumor. In some embodiments, the cancer is lung cancer, the gene set of reference data set is the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and optional one or more clinical characteristics of the reference data set is selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, the dataset of step a”’, contains i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the patient. In some embodiments, the
Attorney Docket No.225234-718601/PCT cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table.7. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a’”, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8. In some embodiments, the cancer is lung cancer, the at least two lung disease-associated genes of step a”’, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8,
Attorney Docket No.225234-718601/PCT RNF114, and DCTN4. In some embodiments, the cancer is lung cancer, the one or more clinical characteristics of the data set of step a’”, include 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the data set of step a’”, include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the cancer is lung cancer, and the data set of step a”’, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, and the data set of step a”’, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the cancer is lung cancer, the at least two lung disease-associated genes of step a’”, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step a’” comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the cancer is lung cancer, the at least two lung disease-associated genes of step a’, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step a”” consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. [0144] In certain embodiments, the cancer is pancreatic cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the cancer is pancreatic cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the cancer is pancreatic cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the cancer is ovarian cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the cancer is ovarian cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the cancer is ovarian cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the cancer is kidney cancer, and the data set of step a”’, contains gene expression measurements of at least two genes
Attorney Docket No.225234-718601/PCT selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the cancer is kidney cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the cancer is kidney cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the cancer is brain cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the cancer is brain cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In certain embodiments, the cancer is brain cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0145] The inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the tumor is malignant, where higher confidence values may be correlated with a higher likelihood that the tumor is malignant. A malignant tumor may be characterized by the ability to metastasize or grow invasively, which may be in contrast to benign nodules. [0146] The biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the biological sample is a blood sample or any derivative thereof. In some embodiments, the biological sample is PBMCs or any derivative thereof. In some embodiments, the biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof. [0147] In certain embodiments, the method includes optionally performing biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In certain embodiments, the method includes optionally performing a biopsy of the solid tumor of the patient based at least in part on the classification of the
Attorney Docket No.225234-718601/PCT solid tumor of the patient as the malignant solid tumor. The decision to perform biopsy may depend on confidence value of the inference. In certain embodiments, biopsy of the solid tumor of the patient is not performed. The machine-learning model, e.g. of step b”’, can generate the inference of whether the data set is indicative of the patient having a malignant solid tumor or a benign solid tumor, wherein the patient having malignant solid tumor may indicate the patient has cancer, and the patient having benign solid tumor may indicate the patient does not have cancer. The machine-learning model of step b”’, can be trained according to a method described herein, e.g. according to the methods training of the machine- learning model of step b. [0148] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of about 80 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer
Attorney Docket No.225234-718601/PCT with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0149] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of about 80 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having
Attorney Docket No.225234-718601/PCT cancer with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0150] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a specificity of about 80 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to
Attorney Docket No.225234-718601/PCT about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0151] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of about 80 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about
Attorney Docket No.225234-718601/PCT 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0152] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of about 80 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 %
Attorney Docket No.225234-718601/PCT to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0153] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a receiver operating characteristic (ROC) curve with an Area-Under- Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8 to about 1. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.995, about
Attorney Docket No.225234-718601/PCT 0.85 to about 1, about 0.9 to about 0.92, about 0.9 to about 0.94, about 0.9 to about 0.95, about 0.9 to about 0.96, about 0.9 to about 0.97, about 0.9 to about 0.98, about 0.9 to about 0.99, about 0.9 to about 0.995, about 0.9 to about 1, about 0.92 to about 0.94, about 0.92 to about 0.95, about 0.92 to about 0.96, about 0.92 to about 0.97, about 0.92 to about 0.98, about 0.92 to about 0.99, about 0.92 to about 0.995, about 0.92 to about 1, about 0.94 to about 0.95, about 0.94 to about 0.96, about 0.94 to about 0.97, about 0.94 to about 0.98, about 0.94 to about 0.99, about 0.94 to about 0.995, about 0.94 to about 1, about 0.95 to about 0.96, about 0.95 to about 0.97, about 0.95 to about 0.98, about 0.95 to about 0.99, about 0.95 to about 0.995, about 0.95 to about 1, about 0.96 to about 0.97, about 0.96 to about 0.98, about 0.96 to about 0.99, about 0.96 to about 0.995, about 0.96 to about 1, about 0.97 to about 0.98, about 0.97 to about 0.99, about 0.97 to about 0.995, about 0.97 to about 1, about 0.98 to about 0.99, about 0.98 to about 0.995, about 0.98 to about 1, about 0.99 to about 0.995, about 0.99 to about 1, or about 0.995 to about 1. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995. The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99.In some embodiments, the treatment is configured to treat a cancer of the patient. In some embodiments, the treatment is configured to reduce a severity of a cancer of the patient. In some embodiments, the treatment is configured to reduce a risk of having a cancer of the patient. The treatment can include one or more treatments of cancer. The cancer can be lung, pancreatic, ovarian, kidney, or brain cancer. In certain embodiments, the cancer is lung cancer. In certain embodiments, the cancer is pancreatic cancer. In certain embodiments, the cancer is ovarian cancer. In certain embodiments, the cancer is kidney cancer. In certain embodiments, the cancer is brain cancer. In some embodiments, the data set is indicative of the patient having lung cancer, and step d”’ can include administering to the patient a treatment for lung cancer. In some embodiments, the data set is indicative of the patient having pancreatic cancer, and step d”’ can include administering to the patient a treatment for pancreatic cancer. In some embodiments, the data set is indicative of the patient having ovarian cancer, and step d”’ can include administering to the patient a treatment for ovarian cancer. In some embodiments, the data set is
Attorney Docket No.225234-718601/PCT indicative of the patient having kidney cancer, and step d”’ can include administering to the patient a treatment for kidney cancer. In some embodiments, the data set is indicative of the patient having brain cancer, and step d”’ can include administering to the patient a treatment for brain cancer. In some embodiments, the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0154] In an aspect, the present disclosure provides a method for assessing a solid tumor of a patient, for biopsy. The method can include, any one of, any combination of, or all of steps w, x, y and z. Step w, can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient. The data set can be obtained from assaying the biological sample. Step x, can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor. Step y, can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor. Step z, can include performing biopsy of the solid tumor based on the solid tumor being classified as the malignant solid tumor or benign tumor. In certain embodiments, step z, can include performing biopsy of the solid tumor based on the solid tumor being classified as the malignant solid tumor. The decision to perform biopsy may depend on confidence value of the inference. In certain embodiments, biopsy of the lung nodule of the patient is not performed. The gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, or the like. [0155] The solid tumor can be a lung tumor, pancreatic tumor, kidney tumor, ovarian tumor, or a brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor [0156] In some embodiments, the solid tumor is a lung tumor, the data set of step w, contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. [0157] In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease- associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the at least
Attorney Docket No.225234-718601/PCT two lung disease-associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of step w, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8. In some embodiments, the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics of the data set of step w, includes 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and one or more clinical characteristics of the data set of step w, includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the solid tumor is a
Attorney Docket No.225234-718601/PCT lung tumor, and the data set of step w, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and the data set of step w, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step w, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step w, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. [0158] In certain embodiments, the solid tumor is a pancreatic tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the solid tumor is a pancreatic tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, and the data set of step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the solid tumor is an ovarian tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the data set of step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the solid tumor is a kidney tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a
Attorney Docket No.225234-718601/PCT kidney tumor, and the data set of step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a brain tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the solid tumor is a brain tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, and the data set of step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0159] The biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the biological sample is a blood sample or any derivative thereof. In some embodiments, the biological sample is PBMCs or any derivative thereof. In some embodiments, the biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof. [0160] The machine learning model of step x, can be a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant. The machine-learning model, e.g. of step x, can be trained according to the methods described herein, e.g. as of the machine learning model of step b. [0161] Certain aspects are directed to a method for determining cancer in a patient. The method can include, any one of, any combination of, or all of steps w’, x’, y’ and z’. Step w’ can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from the gene set capable of classifying a solid tumor of the patient as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient. The gene expression measurements can be obtained by assaying the biological sample. Step x’ can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of the patient having or not having cancer. Step y’ can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the patient having or not having cancer. Step z’ can
Attorney Docket No.225234-718601/PCT include electronically outputting a report indicating the patient has, or does not have cancer. The gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, or the like. [0162] The cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the solid tumor can be a pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor respectively. In certain embodiments, the cancer is lung cancer, and the solid tumor is a lung tumor. In certain embodiments, the cancer is pancreatic cancer, and the solid tumor is a pancreatic tumor. In certain embodiments, the cancer is ovarian cancer, and the solid tumor is an ovarian tumor. In certain embodiments, the cancer is kidney cancer, and the solid tumor is a kidney tumor. In certain embodiments, the cancer is brain cancer, and the solid tumor is a brain tumor. In some embodiments, the cancer is lung cancer, and the data set of step w’, contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142 genes selected from
Attorney Docket No.225234-718601/PCT the group of genes listed in Table 5. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7. In some embodiments, the cancer is lung cancer, the at least two lung disease- associated genes of step w’, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8. In some embodiments, the cancer is lung cancer, the at least two lung disease-associated genes of step w, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group listed in Table 6. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes size of the nodule. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes age of the patient. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes presence of the nodule in the lung upper lobe. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the cancer is lung cancer, and the data set of step w’, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, and the data set of step w’, contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of the clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the cancer is lung cancer, and the at least two lung disease- associated genes of step w’, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step w’, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of step w’, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step w’, consist of the size of the nodule, age of the patient, and the presence of the
Attorney Docket No.225234-718601/PCT nodule in the lung upper lobe. In certain embodiments, the cancer is pancreatic cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the cancer is pancreatic cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the cancer is pancreatic cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the cancer is ovarian cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the cancer is ovarian cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the cancer is ovarian cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the cancer is kidney cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the cancer is kidney cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the cancer is kidney cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the cancer is brain cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the cancer is brain cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In certain embodiments, the cancer is brain cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0163] The biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any
Attorney Docket No.225234-718601/PCT derivative thereof. In some embodiments, the biological sample is a blood sample or any derivative thereof. In some embodiments, the biological sample is PBMCs or any derivative thereof. In some embodiments, the biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof. [0164] The method can determine whether the patient has or does not have cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with an accuracy of about 80 % to about 100 %. The method can determine whether the patient has or does not have cancer with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can determine whether the patient has or does not have cancer with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can determine whether the patient has or does not have cancer with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can determine whether the patient has or does not have cancer with an accuracy of at most
Attorney Docket No.225234-718601/PCT about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0165] The method can determine whether the patient has or does not have cancer with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with a sensitivity of about 80 % to about 100 %. The method can determine whether the patient has or does not have cancer with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can determine whether the patient has or does not have cancer with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can determine whether the patient has or does not have cancer with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can determine whether the patient has or does not have cancer with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0166] The method can determine whether the patient has or does not have cancer with an specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%,
Attorney Docket No.225234-718601/PCT at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with a specificity of about 80 % to about 100 %. The method can determine whether the patient has or does not have cancer with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can determine whether the patient has or does not have cancer with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can determine whether the patient has or does not have cancer with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can determine whether the patient has or does not have cancer with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0167] The method can determine whether the patient has or does not have cancer with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with a positive predictive value of about 80 % to about 100 %. The method can determine whether the patient has or does not have cancer with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to
Attorney Docket No.225234-718601/PCT about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can determine whether the patient has or does not have cancer with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can determine whether the patient has or does not have cancer with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can determine whether the patient has or does not have cancer with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0168] The method can determine whether the patient has or does not have cancer with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with a negative predictive value of about 80 % to about 100 %. The method can determine whether the patient has or does not have cancer with a negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about
Attorney Docket No.225234-718601/PCT 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. The method can determine whether the patient has or does not have cancer with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. The method can determine whether the patient has or does not have cancer with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. The method can determine whether the patient has or does not have cancer with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0169] The machine learning model, e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a receiver operating characteristic (ROC) curve with an AUC of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. The machine learning model, e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8 to about 1. The machine learning model, e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.995, about 0.85 to about 1, about 0.9 to about 0.92, about 0.9 to about 0.94, about 0.9 to about 0.95, about 0.9 to about 0.96, about 0.9 to about 0.97, about 0.9 to about 0.98, about 0.9 to about 0.99, about 0.9 to about 0.995, about 0.9 to about 1, about 0.92 to about 0.94, about 0.92 to about 0.95, about 0.92 to about 0.96, about 0.92 to about 0.97, about 0.92 to about 0.98, about 0.92 to about 0.99, about 0.92 to about 0.995, about 0.92 to about 1, about 0.94 to about 0.95, about 0.94 to about 0.96, about 0.94 to about 0.97, about 0.94 to about 0.98, about 0.94 to about 0.99, about 0.94 to about 0.995, about 0.94 to about 1, about 0.95 to about 0.96,
Attorney Docket No.225234-718601/PCT about 0.95 to about 0.97, about 0.95 to about 0.98, about 0.95 to about 0.99, about 0.95 to about 0.995, about 0.95 to about 1, about 0.96 to about 0.97, about 0.96 to about 0.98, about 0.96 to about 0.99, about 0.96 to about 0.995, about 0.96 to about 1, about 0.97 to about 0.98, about 0.97 to about 0.99, about 0.97 to about 0.995, about 0.97 to about 1, about 0.98 to about 0.99, about 0.98 to about 0.995, about 0.98 to about 1, about 0.99 to about 0.995, about 0.99 to about 1, or about 0.995 to about 1. The machine learning model, e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. The machine learning model, e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995. The machine learning model, e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. [0170] The inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the patient has cancer. Higher confidence values may be correlated with a higher likelihood that the patient has is cancer. [0171] The machine-learning model, e.g. of step x’, can generate inference of whether the data set is indicative of the patient having a malignant solid tumor or a benign solid tumor, wherein the patient having malignant solid tumor may indicate that the patient has cancer, and patient having benign solid tumor may indicate that the patient does not have lung cancer. The machine-learning model can be trained according to a method described herein, e.g. according to the methods of training of the machine learning model of step b. [0172] In another aspect, the present disclosure provides a computer system for assessing a solid tumor of a subject, containing: a database or other suitable data storage system that is configured to store a dataset containing a) gene expression measurements of a biological sample obtained or derived from the subject, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) analyze the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor; (ii) electronically output a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor. Computer-implemented methods as described herein may be executed on computer systems such as those described above. For example, a computer system may comprise one or more processors and one or more memory units that collectively store computer- readable executable instructions that, as a result of execution, cause the one or more processors to
Attorney Docket No.225234-718601/PCT collectively perform the programmed steps described above. A computer system as described herein may comprise an assay device communicatively coupled to a personal computer. The data set can be a data set (e.g. of step a) described herein. The biological sample can be a biological sample described herein. [0173] In some embodiments, the computer system further comprises an electronic display operatively coupled to the one or more computer processors, wherein the electronic display comprises a graphical user interface that is configured to display the report. [0174] In another aspect, the present disclosure provides one or more non-transitory computer readable media collectively containing machine-executable code that, upon execution by one or more computer processors, causes the one or more computer processors to perform a method for assessing a solid tumor of a subject, the method containing: (a) obtaining a data set containing a) gene expression measurements of a biological sample obtained or derived from a subject, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject; (b) analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor; and (c) electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor. The data set can be a data set (e.g. of step a) described herein. The biological sample can be a biological sample described herein. [0175] The disclosure includes the use of any inventive method, system, or other composition described herein, including a gene set determined using the inventive methods, for diagnosing a cancer, or for determining and/or administering a treatment of a patient or subject having a cancer. [0176] The current disclosure includes the following aspects [0177] Aspect 1, is directed to a method for assessing a lung nodule of a subject, comprising: (a) assaying a biological sample obtained or derived from the subject to produce a data set comprising gene expression measurements of the biological sample at each of a plurality of lung disease-associated genomic loci, wherein the plurality of disease-associated genomic loci comprises at least one gene selected from the group listed in in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8; (b) analyzing the data set to classify the lung nodule of the subject as a malignant lung nodule or a benign lung nodule; and (c) electronically outputting a report indicative of the classification of the lung nodule of the subject as the malignant lung nodule or the benign lung nodule. [0178] Aspect 2 is directed to the method of aspect 1, wherein the plurality of disease-associated genomic loci comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
Attorney Docket No.225234-718601/PCT 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295 genes selected from the group listed in Table 4. [0179] Aspect 3 is directed to the method of aspect 1 or 2, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0180] Aspect 4 is directed to the method of any one of aspects 1 to 3, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with an sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0181] Aspect 5 is directed to the method of any one of aspects 1 to 4, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with an specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0182] Aspect 6 is directed to the method of any one of aspects 1 to 5, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0183] Aspect 7 is directed to the method of any one of aspects 1 to 6, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0184] Aspect 8 is directed to the method of any one of aspects 1 to 7, further comprising classifying the lung nodule of the subject as the malignant lung nodule or the benign lung nodule with an Area-Under-
Attorney Docket No.225234-718601/PCT Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0185] Aspect 9 is directed to the method of any one of aspects 1 to 8, wherein the subject has a lung cancer. [0186] Aspect 10 is directed to the method of any one of aspects 1 to 8, wherein the subject is suspected of having a lung cancer. [0187] Aspect 11 is directed to the method of any one of aspects 1 to 8, wherein the subject is at elevated risk of having a lung cancer. [0188] Aspect 12 is directed to the method of any one of aspects 1 to 8, wherein the subject is asymptomatic for a lung cancer. [0189] Aspect 13 is directed to the method of any one of aspects 1 to 12 further comprising administering a treatment to the subject based at least in part on the classification of the lung nodule of the subject as the malignant lung nodule or the benign lung nodule. [0190] Aspect 14 is directed to the method of aspect 13, wherein the treatment is configured to treat a lung cancer of the subject. [0191] Aspect 15 is directed to the method of aspect 13, wherein the treatment is configured to reduce a severity of a lung cancer of the subject. [0192] Aspect 16 is directed to the method of aspect 13, wherein the treatment is configured to reduce a risk of having a lung cancer of the subject. [0193] Aspect 17 is directed to the method of aspect 13, wherein the treatment is selected from the group consisting of: surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, and any combination thereof. [0194] Aspect 18 is directed to the method of aspect 1, wherein (b) comprises using a trained machine learning classifier to analyze the data set to classify the lung nodule of the subject as the malignant lung nodule or the benign lung nodule. [0195] Aspect 19 is directed to the method of aspect 18, wherein the trained machine learning classifier is trained using gene expression data obtained by a data analysis tool selected from the group consisting of: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a CellScan big data analysis tool, an MS (Molecular Signature) Scoring ™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope). [0196] Aspect 20 is directed to the method of aspect 18, wherein the trained machine learning classifier is selected from the group consisting of a linear regression, a logistic regression, a Ridge regression, a
Attorney Docket No.225234-718601/PCT Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, and a combination thereof. [0197] Aspect 21 is directed to the method of aspect 20, wherein the trained machine learning classifier comprises the logistic regression. [0198] Aspect 22 is directed to the method of aspect 20, wherein the trained machine learning classifier comprises the GLM. [0199] Aspect 23 is directed to the method of aspect 20, wherein the trained machine learning classifier comprises the kNN. [0200] Aspect 24 is directed to the method of aspect 20, wherein the trained machine learning classifier comprises the SVM. [0201] Aspect 25 is directed to the method of aspect 20, wherein the trained machine learning classifier comprises the GBM. [0202] Aspect 26 is directed to the method of aspect 20, wherein the trained machine learning classifier comprises the RF. [0203] Aspect 27 is directed to the method of aspect 20, wherein the trained machine learning classifier comprises the NB. [0204] Aspect 28 is directed to the method of aspect 20, wherein the trained machine learning classifier comprises the EN regression. [0205] Aspect 29 is directed to the method of aspect 1, wherein (b) comprises comparing the data set to a reference data set. [0206] Aspect 30 is directed to the method of aspect 29, wherein the reference data set comprises gene expression measurements of reference biological samples at each of the plurality of lung disease- associated genomic loci. [0207] Aspect 31 is directed to the method of aspect 29, wherein the reference biological samples comprise a first plurality of biological samples obtained or derived from subjects having a malignant lung nodule and a second plurality of biological samples obtained or derived from subjects having a benign lung nodule. [0208] Aspect 32 is directed to the method of any one of aspects 1 to 31, wherein the biological sample is selected from the group consisting of: a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a lung biopsy sample, or any derivative thereof. [0209] Aspect 33 is directed to the method of any one of aspects 1 to 32, further comprising determining a likelihood of the classification of the lung nodule of the subject as the malignant lung nodule or the benign lung nodule.
Attorney Docket No.225234-718601/PCT [0210] Aspect 34 is directed to the method of any one of aspects 1 to 33, further comprising monitoring the lung nodule of the subject, wherein the monitoring comprises assessing the lung nodule of the subject at a plurality of time points. [0211] Aspect 35 is directed to the method of aspect 34, wherein a difference in the assessment of the lung nodule of the subject among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the lung nodule of the subject, (ii) a prognosis of the lung nodule of the subject, and (iii) an efficacy or non-efficacy of a course of treatment for treating the lung nodule of the subject. [0212] Aspect 36 is directed to a computer system for assessing a lung nodule of a subject, comprising: a database that is configured to store a dataset comprising gene expression data, wherein the gene expression data is obtained by assaying a biological sample obtained or derived from the subject to produce gene expression measurements of the biological sample at each of a plurality of lung disease- associated genomic loci, wherein the plurality of disease-associated genomic loci comprises at least one gene selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) analyze the data set to classify the lung nodule of the subject as a malignant lung nodule or a benign lung nodule; (ii) electronically output a report indicative of the classification of the lung nodule of the subject as the malignant lung nodule or the benign lung nodule. [0213] Aspect 37 is directed to the computer system of aspect 36, further comprising an electronic display operatively coupled to the one or more computer processors, wherein the electronic display comprises a graphical user interface that is configured to display the report. [0214] Aspect 38 is directed to one or more non-transitory computer readable media collectively comprising machine-executable code that, upon execution by one or more computer processors, implements a method for assessing a lung nodule of a subject, the method comprising: (a) assaying a biological sample obtained or derived from the subject to produce a data set comprising gene expression measurements of the biological sample at each of a plurality of lung disease-associated genomic loci, wherein the plurality of disease-associated genomic loci comprises at least one gene selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8; (b) analyzing the data set to classify the lung nodule of the subject as a malignant lung nodule or a benign lung nodule; and (c) electronically outputting a report indicative of the classification of the lung nodule of the subject as the malignant lung nodule or the benign lung nodule.
Attorney Docket No.225234-718601/PCT [0215] Aspect 39 is directed to a method for assessing a lung nodule of a patient, the method comprising: a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant lung nodule or a benign lung nodule; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant lung nodule or the benign lung nodule; and d) electronically outputting a report classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule. [0216] In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 3. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7. In some embodiments, the data set of Aspect 39, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 8. In some embodiments, the data set of Aspect 39, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 39, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical
Attorney Docket No.225234-718601/PCT characteristics listed in Table 6. In some embodiments, the data set of Aspect 39, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 3, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 39, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 39, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 39, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 39, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 8, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. [0217] Aspect 40 is directed to the method of aspect 39, wherein the at least two lung disease-associated genes are selected from the group of genes listed in Table 7. [0218] Aspect 41 is directed to the method of aspects 39 or 40, wherein the one or more clinical characteristics comprises size of the nodule, age of the patient, and presence of the nodule in the lung upper lobe. [0219] Aspect 42 is directed to the method of any one of aspects 39 to 41, wherein the machine-learning model is developed using a linear regression, a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof. [0220] Aspect 43 is directed to the method of any one of aspects 39 to 42, wherein the patient has lung cancer. [0221] Aspect 44 is directed to the method of any one of aspects 39 to 42, wherein the patient does not have lung cancer.
Attorney Docket No.225234-718601/PCT [0222] Aspect 45 is directed to the method of any one of aspects 39 to 42, wherein the patient is at an elevated risk of having lung cancer. [0223] Aspect 46 is directed to the method of any one of aspects 39 to 43 and 45, wherein the patient is asymptomatic for lung cancer. [0224] Aspect 47 is directed to the method of any one of aspects 39 to 43, 45 and 46, further comprising administering a treatment based on the patient’s nodule being classified as a malignant nodule. [0225] Aspect 48 is directed to the method of aspect 47, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0226] Aspect 49 is directed to the method of any one of aspects 39 to 48, wherein the inference includes a confidence value between 0 and 1 that the lung nodule is malignant. [0227] Aspect 50 is directed to the method of any one of aspects 39 to 49, wherein the at least two lung disease-associated genes comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, genes selected from the group of genes listed in Table 4. [0228] Aspect 51 is directed to the method of any one of aspects 39 to 50, wherein the at least two lung disease-associated genes comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31, genes selected from the group of genes listed in Table 7. [0229] Aspect 52 is directed to the method of any one of aspects 39 to 51, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0230] Aspect 53 is directed to the method of any one of aspects 39 to 52, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0231] Aspect 54 is directed to the method of any one of aspects 39 to 53, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about
Attorney Docket No.225234-718601/PCT 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0232] Aspect 55 is directed to the method of any one of aspects 39 to 54, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0233] Aspect 56 is directed to the method of any one of aspects 39 to 55, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0234] Aspect 57 is directed to the method of any one of aspects 39 to 56, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0235] Aspect 58 is directed to a system for assessing a lung module of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample of a patient of a plurality of lung disease-associated genes, selected from the group of genes listedin any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant lung nodule or a benign lung nodule; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant lung nodule or the benign lung nodule; and
Attorney Docket No.225234-718601/PCT generate a report classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule. [0236] In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 3. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7. In some embodiments, the data set of Aspect 58, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 8. In some embodiments, the data set of Aspect 58, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 58, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 58, comprises) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 3, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 58, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 58, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 58, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung
Attorney Docket No.225234-718601/PCT disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 58, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 8, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. [0237] Aspect 59 is directed to a non-transitory computer-readable medium storing executable instructions for assessing a lung nodule of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample of a patient of a plurality of lung disease-associated genes, selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant lung nodule or a benign lung nodule; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant lung nodule or the benign lung nodule; and generate a report classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule. [0238] In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 3. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5. In some embodiments, the data set of Aspect 59, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7. In some embodiments, the data set of Aspect 59,
Attorney Docket No.225234-718601/PCT comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 8. In some embodiments, the data set of Aspect 59, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 59, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 59, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 3, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 59, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 59, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 59, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 59, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 8, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. [0239] Aspect 60 is directed a method for determining a gene set capable of classifying a lung nodule benign or malignant without performing biopsy, the method comprising: obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a lung nodule, ii) data regarding whether the lung nodule of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the reference subject,, wherein the reference biological
Attorney Docket No.225234-718601/PCT sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics; determining feature importance values of the plurality of genes; and determining the gene set based at least in part on the feature importance values. [0240] In some embodiments, the respective individual reference data set of Aspect 60, comprises i) gene expression measurements of the plurality of genes of the reference biological sample, and ii) data regarding whether the lung nodule of the reference subject is benign or malignant; and the machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes. In some embodiments, the respective individual reference data set of Aspect 60, comprises i) gene expression measurements of the plurality of genes of the reference biological sample, ii) data regarding whether the lung nodule of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the reference subject; and the machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes and the one or more clinical characteristics. [0241] Aspect 61 is directed to the method of aspect 60, wherein the plurality of genes comprises at least 2 genes selected from the group of genes listed in Table 9. [0242] Aspect 62 is directed a method for developing a trained machine learning model capable of inferring whether a lung nodule of a patient is benign or malignant, the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a lung nodule, ii) data regarding whether the lung nodule of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics listed in Table 6 of the reference subject, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a lung nodule is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics;
Attorney Docket No.225234-718601/PCT (c) determining feature importance values of the one or more predictors of the first machine learning model; (d) selecting A predictors of the first machine learning model based at least in part on the feature importance values, wherein A is an integer from 5 to 2000; and (e) training a second machine learning model based at least in part on a second reference data set comprising a plurality of second individual reference data sets, wherein a respective second individual reference data set of the plurality of second individual reference data sets comprises i) measurement data of the A predictors of the reference subject, and ii) data regarding whether the lung nodule of the reference subject is benign or malignant, to obtain the trained machine learning model, wherein the trained machine learning model is trained to infer whether a lung nodule is benign or malignant, based at least in part on measurement data of the A predictors. [0243] In some embodiments, the respective first individual reference data set of Aspect 62, comprises i) gene expression measurements of the plurality of genes of the reference biological sample, and ii) data regarding whether the lung nodule of the reference subject is benign or malignant; and the first machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes. In some embodiments, the respective first individual reference data set of Aspect 60, comprises i) gene expression measurements of the plurality of genes of the reference biological sample, ii) data regarding whether the lung nodule of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the reference subject; and the first machine learning model is trained to infer whether a lung nodule is benign or malignant based at least in part on one or more predictors selected from the plurality of genes and the one or more clinical characteristics. [0244] Aspect 63 is directed to the aspect of 62, wherein the plurality of genes comprises at least 2 genes selected from the group of genes listed in Table 9. [0245] Aspect 64 is directed to the method of any one of aspects 62 to 63, wherein the A predictors have top 5 to 200 feature importance values. [0246] Aspect 65 is directed to the method of any one of aspects 62 to 64, wherein the trained machine learning model has an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0247] Aspect 66 is directed to the method of any one of aspects 62 to 65, wherein the trained machine learning model has an sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about
Attorney Docket No.225234-718601/PCT 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0248] Aspect 67 is directed to the method of any one of aspects 62 to 66, wherein the trained machine learning model has an specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0249] Aspect 68 is directed to the method of any one of aspects 62 to 67, wherein the trained machine learning model has a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0250] Aspect 69 is directed to the method of any one of aspects 62 to 68, wherein the trained machine learning model has a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0251] Aspect 70 is directed to the method of any one of aspects 62 to 69, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0252] Aspect 71 is directed to the method of any one of aspects 62 to 70, wherein the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof. [0253] Aspect 72 is directed to a method for assessing a lung nodule of a patient, the method comprising: (a) obtaining a data set comprising measurement data of the patient of one or more of the A predictors of any one of aspects 62 to 64;
Attorney Docket No.225234-718601/PCT (b) providing the data set as an input to a trained machine-learning model trained according to the methods of any one of claims 62 to 71 to generate an inference of whether the data set is indicative a malignant lung nodule or a benign lung nodule; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant lung nodule or the benign lung nodule; and (d) electronically outputting a report classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule. [0254] Aspect 73 is directed to the method of aspect 72, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof. [0255] Aspect 74 is directed to the method of any one of aspects 72 to 73, wherein the patient has lung cancer. [0256] Aspect 75 is directed to the method of any one of aspects 72 to 73, wherein the patient does not have lung cancer. [0257] Aspect 76 is directed to the method of any one of aspects 72 to 73, wherein the patient is at elevated risk of having lung cancer. [0258] Aspect 77 is directed to the method of any one of aspects 72 to 74 and 76, wherein the patient is asymptomatic for lung cancer. [0259] Aspect 78 is directed to the method of any one of aspects 72 to 74, 76 and 77, further comprising administering a treatment based on the patient’s lung nodule being classified as a malignant nodule. [0260] Aspect 79 is directed to the method of aspect 78, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0261] Aspect 80 is directed to a method for treating lung cancer in a patient having a lung nodule, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof;
Attorney Docket No.225234-718601/PCT (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant lung nodule or a benign lung nodule; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant lung nodule or the benign lung nodule; and (d) administering a treatment based on the patient’s lung nodule being classified as the malignant lung nodule. [0262] In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 3. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7. In some embodiments, the data set of Aspect 80, comprises gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 8. In some embodiments, the data set of Aspect 80, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 1, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 80, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 2, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 80, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 3, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 80, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 4, and ii) clinical characteristics data of one or more clinical characteristics
Attorney Docket No.225234-718601/PCT of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 80, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 5, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 80, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. In some embodiments, the data set of Aspect 80, comprises i) gene expression measurements of the biological sample from the patient, of at least two lung disease- associated genes selected from the group of genes listed in Table 8, and ii) clinical characteristics data of one or more clinical characteristics of the patient, selected from the group of clinical characteristics listed in Table 6. [0263] Aspect 81 is directed to the method of aspect 80, wherein the at least two lung disease-associated genes are selected from the group of genes listed in Table 7. [0264] Aspect 82 is directed to the method of aspects 80 or 81, wherein the one or more clinical characteristics comprises size of the nodule, age of the patient, and presence of the nodule in the lung upper lobe. [0265] Aspect 83 is directed to the method of any one of aspects 80 to 82, wherein the machine-learning model is developed using a linear regression, a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof. [0266] Aspect 84 is directed to the method of any one of aspects 80 to 83, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0267] Aspect 85 is directed to the method of any one of aspects 80 to 84, wherein the inference includes a confidence value between 0 and 1 that the lung nodule is malignant. [0268] Aspect 86 is directed to the method of any one of aspects 80 to 85, wherein the at least two lung disease-associated genes comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, genes selected from the group of genes listed in Table 4.
Attorney Docket No.225234-718601/PCT [0269] Aspect 87 is directed to the method of any one of aspects 80 to 86, wherein the at least two lung disease-associated genes comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31, genes selected from the group of genes listed in Table 7. [0270] Aspect 88 is directed to the method of any one of aspects 80 to 87, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0271] Aspect 89 is directed to the method of any one of aspects 80 to 88, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0272] Aspect 90 is directed to the method of any one of aspects 80 to 89, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0273] Aspect 91 is directed to the method of any one of aspects 80 to 90, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0274] Aspect 92 is directed to the method of any one of aspects 80 to 91, comprising classifying the lung nodule of the patient as the malignant lung nodule or the benign lung nodule with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0275] Aspect 93 is directed to the method of any one of aspects 80 to 92, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least
Attorney Docket No.225234-718601/PCT about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0276] Aspects 94 is directed to a method for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy, the method comprising: (a) obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a solid tumor is benign or malignant based at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics; (c) determining feature importance values of the plurality of genes; and (d) determining the gene set based at least in part on the feature importance values. [0277] In certain embodiments, the respective individual reference data set of Aspect 94, comprises i) gene expression measurements of the plurality of genes of the reference biological sample from the reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant and iii) clinical characteristics data of one or more clinical characteristics of the reference subject, and the machine learning model is trained to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes, and clinical characteristics data of the one or more clinical characteristics. In certain embodiments, the respective individual reference data set of Aspect 94, comprises i) gene expression measurements of the plurality of genes of the reference biological sample from the reference subject having a reference solid tumor, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and the machine learning model is trained to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes. [0278] Aspect 95 is directed to the aspect of 94, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney tumor or brain tumor. [0279] Aspect 96 is directed to the aspect of 94, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer.
Attorney Docket No.225234-718601/PCT [0280] Aspect 97 is directed to the aspect of 94, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer [0281] Aspect 98 is directed to the aspect of 94, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer. [0282] Aspect 99 is directed to the aspect of 94, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer. [0283] Aspect 100 is directed to the aspect of 94 or 96, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. [0284] Aspect 101 is directed to the aspect of 94 or 97, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. [0285] Aspect 102 is directed to the aspect of 94 or 98, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. [0286] Aspect 103 is directed to the aspect of 94 or 99, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. [0287] Aspect 104 is directed to a method for developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant, the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics; (c) determining feature importance values of the one or more predictors of the first machine learning model;
Attorney Docket No.225234-718601/PCT (d) selecting A predictors of the first machine learning model based at least in part on the feature importance values, wherein A is an integer from 5 to 2000; and (e) training a second machine learning model based at least in part on a second reference data set comprising a plurality of second individual reference data sets, wherein a respective second individual reference data set of the plurality of second individual reference data sets comprises i) measurement data of the A predictors of the reference subject, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, to obtain the trained machine learning model, wherein the trained machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on measurement data of the A predictors. [0288] In certain embodiments, the respective first individual reference data set of aspect 104, comprises i) gene expression measurements of the plurality of genes of the reference biological sample from the reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant and iii) clinical characteristics data of one or more clinical characteristics of the reference subject, and the first machine learning model is trained to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes, and clinical characteristics data of the one or more clinical characteristics. In certain embodiments, the respective individual reference data set of aspect 104, comprises i) gene expression measurements of a plurality of genes of the reference biological sample from the reference subject having the reference solid tumor, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and the machine learning model is trained to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes. [0289] Aspect 105 is directed to the method of aspect 104, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney tumor or brain tumor. [0290] Aspect 106 is directed to the method of aspect 104, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer. [0291] Aspect 107 is directed to the method of aspect 104, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer. [0292] Aspect 108 is directed to the method of aspect 104, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer. [0293] Aspect 109 is directed to the method of aspect 104, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer.
Attorney Docket No.225234-718601/PCT [0294] Aspect 110 is directed to the method of aspect 104 or 106, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. [0295] Aspect 111 is directed to the method of aspect 104 or 107, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. [0296] Aspect 112 is directed to the method of aspect 104 or 108, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. [0297] Aspect 113 is directed to the method of aspect 104 or 109, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. [0298] Aspect 114 is directed to the method of any one of aspects 104 to 113, wherein the A predictors have top 5 to 200 feature importance values. [0299] Aspect 115 is directed to the method of any one of aspects 104 to 114, wherein the trained machine learning model has an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0300] Aspect 116 is directed to the method of any one of aspects 104 to 115, wherein the trained machine learning model has a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0301] Aspect 117 is directed to the method of any one of aspects 104 to 116, wherein the trained machine learning model has a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0302] Aspect 118 is directed to the method of any one of aspects 104 to 117, wherein the trained machine learning model has a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%,
Attorney Docket No.225234-718601/PCT at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0303] Aspect 119 is directed to the method of any one of aspects 104 to 118, wherein the trained machine learning model has a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0304] Aspect 120 is directed to the method of any one of aspects 104 to 117, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0305] Aspect 121 is directed to the method of any one of aspects 104 to 120, wherein the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof. [0306] Aspect 122 is directed a method for assessing a solid tumor of a patient, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of aspect 94, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) electronically outputting a report classifying the solid tumor of the patient as the malignant or the benign solid tumor. [0307] In certain embodiments, the data set of Aspect 122 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of Aspect 94. In
Attorney Docket No.225234-718601/PCT certain embodiments, the data set of Aspect 122 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of Aspect 94, and ii) clinical characteristics data of one or more clinical characteristics of the patient. [0308] Aspect 123 is directed to the method of aspect 122, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney tumor, or brain tumor. [0309] Aspect 124 is directed to the method of aspect 122, wherein the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set of any one of aspects 94, 96, or 100, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant. [0310] Aspect 125 is directed to the method of aspect 122, wherein the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set of any one of aspects 94, 97, or 101, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant. [0311] Aspect 126 is directed to the method of aspect 122, wherein the solid tumor is a brain tumor, and the at least 2 genes selected are from the gene set of any one of aspects 94, 98, or 102, wherein the gene set is capable of classifying the brain tumor as benign or malignant. [0312] Aspect 127 is directed to the method of aspect 122, wherein the solid tumor is a kidney tumor, and the at least 2 genes selected are from the gene set of any one of aspects 94, 99, or 103, wherein the gene set is capable of classifying the kidney tumor as benign or malignant. [0313] Aspect 128 is directed to the method of aspect 122 or 124, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. [0314] Aspect 129 is directed to the method of aspect 122 or 125, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. [0315] Aspect 130 is directed to the method of aspect 122 or 126, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. [0316] Aspect 131 is directed to the method of aspect 122 or 127, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. [0317] Aspect 132 is directed to the method of any one of aspects 122 to 131, wherein the machine- learning model is trained according to the method any one of aspects 104 to 121. [0318] Aspect 133 is directed to the method of any one of aspects 122 to 132, wherein the patient has cancer.
Attorney Docket No.225234-718601/PCT [0319] Aspect 134 is directed to the method of any one of aspects 122 to 132, wherein the patient does not have cancer. [0320] Aspect 135 is directed to the method of any one of aspects 122 to 132, wherein the patient is at an elevated risk of having cancer. [0321] Aspect 136 is directed to the method of any one of aspects 122 to 133, and 135, wherein the patient is asymptomatic for cancer. [0322] Aspect 137 is directed to the method of any one of aspects 133 to 136, wherein the cancer is pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer. [0323] Aspect 138 is directed to the method of any one of aspects 122 to 137, further comprising administering a treatment based on the patient’s solid tumor being classified as malignant. [0324] Aspect 139 is directed to the method of aspect 138, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0325] Aspect 140 is directed to the method of any one of aspects 122 to 139, wherein the inference includes a confidence value between 0 and 1 that the solid tumor is malignant. [0326] Aspect 141 is directed to the method of any one of aspects 122 to 140, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0327] Aspect 142 is directed to the method of any one of aspects 122 to 141, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0328] Aspect 143 is directed to the method of any one of aspects 122 to 142, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0329] Aspect 144 is directed to the method of any one of aspects 122 to 143, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least
Attorney Docket No.225234-718601/PCT about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0330] Aspect 145 is directed to the method of any one of aspects 122 to 144, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0331] Aspect 146 is directed to the method of any one of aspects 122 to 145, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0332] Aspect 147 is directed to a method for treating cancer in a patient having a solid tumor, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) administering a treatment based on the patient’s tumor being classified as a malignant tumor [0333] In certain embodiments, the data set of aspect 147 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103. In certain embodiments, the data set of aspect 147 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) clinical characteristics data of one or more clinical characteristics of the patient. [0334] Aspect 148 is directed to the method of aspect 147, wherein the cancer is a pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer.
Attorney Docket No.225234-718601/PCT [0335] Aspect 149 is directed to a system for assessing a solid tumor of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. [0336] In certain embodiments, the data set of aspect 149 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103. In certain embodiments, the data set of aspect 149 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) clinical characteristics data of one or more clinical characteristics of the patient. [0337] Aspect 150 is directed to a non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor;
Attorney Docket No.225234-718601/PCT receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. [0338] In certain embodiments, the data set of aspect 150 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103. In certain embodiments, the data set of aspect 150 comprises i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of aspects 94 to 103, and ii) clinical characteristics data of one or more clinical characteristics of the patient. Numbered embodiments 1. A method for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy, the method comprising: a) obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a solid tumor is benign or malignant based at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics; c) determining feature importance values of the plurality of genes; and d) determining the gene set based at least in part on the feature importance values. 2. The method of embodiment 1, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor. 3. The method of embodiment 1, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer. 4. The method of embodiment 1, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer
Attorney Docket No.225234-718601/PCT 5. The method of embodiment 1, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer. 6. The method of embodiment 1, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer. 7. The method of embodiment 1 or 3, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. 8. The method of embodiment 1 or 4, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. 9. The method of embodiment 1 or 5, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. 10. The method of embodiment 1 or 6, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. 11. A method for developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant, the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics; (c) determining feature importance values of the one or more predictors of the first machine learning model; (d) selecting A predictors of the first machine learning model based at least in part on the feature importance values, wherein A is an integer from 5 to 2000; and (e) training a second machine learning model based at least in part on a second reference data set comprising a plurality of second individual reference data sets, wherein a respective second individual reference data set of the plurality of second individual reference data sets comprises i) measurement data of the A predictors of the reference subject, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant,
Attorney Docket No.225234-718601/PCT to obtain the trained machine learning model, wherein the trained machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on measurement data of the A predictors. 12. The method of embodiment 11, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor. 13. The method of embodiment 11, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer. 14. The method of embodiment 11, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer. 15. The method of embodiment 11, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer. 16. The method of embodiment 11, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer. 17. The method of embodiment 11 or 13, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. 18. The method of embodiment 11 or 14, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. 19. The method of embodiment 11 or 15, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. 20. The method of embodiment 11 or 16, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. 21. The method of any one of embodiments 11 to 20, wherein the A predictors have top 5 to 200 feature importance values. 22. The method of any one of embodiments 11 to 21, wherein the trained machine learning model has an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 23. The method of any one of embodiments 11 to 22, wherein the trained machine learning model has a sensitivity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 24. The method of any one of embodiments 11 to 23, wherein the trained machine learning model has a specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least
Attorney Docket No.225234-718601/PCT about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 25. The method of any one of embodiments 11 to 24, wherein the trained machine learning model has a positive predictive value at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 26. The method of any one of embodiments 11 to 25, wherein the trained machine learning model has a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 27. The method of any one of embodiments 11 to 26, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. 28. The method of any one of embodiments 11 to 27, wherein the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof. 29. A method for assessing a solid tumor of a patient, the method comprising: a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of embodiment 1, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and
Attorney Docket No.225234-718601/PCT d) electronically outputting a report classifying the solid tumor of the patient as the malignant or the benign solid tumor. 30. The method of embodiment 29, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor. 31. The method of embodiment 29, wherein the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set any one of embodiments 1, 3, or 7, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant. 32. The method of embodiment 29, wherein the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set of embodiments 1, 4, or 8, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant. 33. The method of embodiment 29, wherein the solid tumor is a brain tumor, and the at least 2 genes selected are from the gene set of embodiments 1, 5, or 9, wherein the gene set is capable of classifying the brain tumor as benign or malignant. 34. The method of embodiment 29, wherein the solid tumor is a kidney tumor, and the at least 2 genes selected are from the gene set of embodiments 1, 6, or 10, wherein the gene set is capable of classifying the kidney tumor as benign or malignant. 35. The method of embodiment 29 or 31, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. 36. The method of embodiment 29 or 32, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. 37. The method of embodiment 29 or 33, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. 38. The method of embodiment 29 or 34, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. 39. The method of any one of embodiments 29 to 38, wherein the machine-learning model is trained according to the method of any one of embodiments 11 to 28. 40. The method of any one of embodiments 29 to 39, wherein the patient has cancer. 41. The method of any one of embodiments 29 to 39, wherein the patient does not have cancer. 42. The method of any one of embodiments 29 to 39, wherein the patient is at an elevated risk of having cancer. 43. The method of any one of embodiments 29 to 39, and 42, wherein the patient is asymptomatic for cancer.
Attorney Docket No.225234-718601/PCT 44. The method of any one of embodiments 40 to 43, wherein the cancer is pancreatic cancer, ovarian cancer or brain cancer. 45. The method of any one of embodiments 29 to 44, further comprising administering a treatment based on the patient’s solid tumor being classified as malignant. 46. The method of embodiment 45, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. 47. The method of any one of embodiments 29 to 46, wherein the inference includes a confidence value between 0 and 1 that the solid tumor is malignant. 48. The method of any one of embodiments 29 to 47, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 49. The method of any one of embodiments 29 to 48, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 50. The method of any one of embodiments 29 to 49, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 51. The method of any one of embodiments 29 to 50, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 52. The method of any one of embodiments 29 to 51, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 53. The method of any one of embodiments 29 to 52, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about
Attorney Docket No.225234-718601/PCT 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. 54. A method for treating cancer in a patient having a solid tumor, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) administering a treatment based on the patient’s tumor being classified as a malignant tumor. 55. The method of embodiment 54, wherein the cancer is a pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer. 56. A system for assessing a solid tumor of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and
Attorney Docket No.225234-718601/PCT generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. 57. A non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. 58. A method for obtaining a gene set capable of classifying whether a patient has cancer, the method comprising: (a) providing a dataset as an input to a machine learning classifier, said dataset comprises or is derived from gene expression measurements from a plurality of reference samples, of genes listed in a plurality of gene modules, wherein the plurality of reference samples comprises a first plurality of reference samples obtained or derived from reference subjects having cancer, and a second plurality of reference samples obtained or derived from reference subjects not having cancer; (b) performing feature selection to select a subset of gene modules from the plurality of gene modules, wherein the plurality of the gene modules form the features of the machine learning classifier; and (c) selecting differentially expressed genes from the genes listed in the subset of gene modules to obtain the gene set capable of classifying whether the patient has cancer. 59. The method of embodiment 58, wherein the machine learning classifier is sequential grouped feature importance (SGFI) algorithm. 60. The method of embodiment 58 or 59, wherein the feature selection comprises starting from a featureless model, and sequentially adding next best feature using leave-one-group-in importance
Attorney Docket No.225234-718601/PCT (LOGI) until no further improvement in mean misclassification error (MMCE) over an improvement threshold is achieved. 61. The method of embodiment 60, wherein the improvement threshold is 0.00001, 0.00005, 0.0001, 0.0005, or 0.001. 62. The method of any one of embodiments 58 to 61, wherein the dataset is a batch corrected dataset. 63. The method of any one of embodiments 58 to 62, wherein the plurality of gene modules are obtained by a method comprising: providing an initial data set comprising gene expression measurement from the plurality of reference samples, of genes of an initial gene-set; selecting M genes from the initial gene set, wherein said M genes are M variably expressed genes of the initial data set, and wherein M is an integer; and clustering the M genes to obtain the plurality of gene modules. 64. The method of embodiment 63, wherein the M genes are clustered based on protein protein interaction of the proteins encoded by the M genes. 65. The method of embodiment 63 to 64, wherein the M genes are M most variably expressed genes of the initial data set. 66. The method of any one of embodiments 63 to 65, wherein M is 500 to 10000. 67. The method of any one of embodiments 58 to 66, further comprising analyzing a patient data set comprising or derived from gene expression measurement of at least 2 genes selected from the genes within the gene set obtained in step (c) to classify whether a patient has cancer, wherein the gene expression measurement is obtained from a biological sample obtained or derived from the patient. 68. The method of embodiment 67, wherein the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 110, 120, 130, 140, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, or all genes selected from the genes within the gene set obtained in step (c). 69. The method of embodiment 67, wherein the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c). 70. The method of any one of embodiments 67 to 69, wherein the method classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
Attorney Docket No.225234-718601/PCT 71. The method of any one of embodiments 67 to 70, wherein the method classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 72. The method of any one of embodiments 67 to 71, wherein the method classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 73. The method of any one of embodiments 67 to 72, wherein the method classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 74. The method of any one of embodiments 67 to 73, wherein the method classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 75. The method of any one of embodiments 67 to 74, wherein analyzing the patient data set comprises providing the patient data set as an input to a machine-learning model trained to generate an inference indicative of whether the patient has cancer based on the patient data set. 76. The method of embodiment 75, further comprising: a) receiving as an output of the machine-learning model the inference; and b) electronically outputting a report classifying whether the patient has cancer based on the inference. 77. The method of embodiment 75 or 76, wherein the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof. 78. The method of any one of embodiments 75 to 77, wherein the machine-learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about
Attorney Docket No.225234-718601/PCT 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer. 79. The method of any one of embodiments 67 to 78, wherein the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof. 80. The method of any one of embodiments 67 to 79, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer. 81. The method of embodiment 80, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof. 82. The method of any one of embodiments 58 to 81, wherein the cancer is a solid cancer. 83. The method of embodiment 82, wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer. 84. The method of any one of embodiments 58 to 81, wherein the cancer is a blood cancer. 85. The method of embodiment 84, the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post-transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. 86. A method for classifying whether a patient has cancer, the method comprising: providing a patient data set comprising or derived from gene expression measurements of at least 2 genes selected from genes within the gene set obtained in step (c) of any one of embodiments 58-66 as an input to a machine learning model trained to generate an inference of whether the patient data set is indicative of the patient having cancer; receiving, as an output of the machine-learning model the inference; and electronically outputting a report classifying whether the patient has cancer based on the inference, wherein the gene expression measurements are obtained from a biological sample obtained or derived from the patient. 87. The method of embodiment 86, wherein the patient dataset comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
Attorney Docket No.225234-718601/PCT 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200 or all genes selected from the genes within the gene set obtained in step (c) of any one of embodiments 58 to 66. 88. The method of embodiment 86, wherein the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c). 89. The method of any one of embodiments 86 to 88, wherein the patient dataset is derived from the gene expression measurements using gene set variation analysis (GSVA), gene set enrichment analysis (GSEA), enrichment algorithm, multiscale embedded gene co-expression network analysis (MEGENA), weighted gene co-expression network analysis (WGCNA), differential expression analysis, Z-score, log2 expression analysis, or any combination thereof. 90. The method of any one of embodiments 86 to 88, wherein the patient dataset is derived from the gene expression measurements using GSVA. 91. The method of any one of embodiments 86 to 90, wherein the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof. 92. The method of any one of embodiments 86 to 91, wherein the method classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 93. The method of any one of embodiments 86 to 92, wherein the method classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 94. The method of any one of embodiments 86 to 93, wherein the method classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at
Attorney Docket No.225234-718601/PCT least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 95. The method of any one of embodiments 86 to 94, wherein the method classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 96. The method of any one of embodiments 86 to 95, wherein the method classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 97. The method of any one of embodiments 86 to 96, wherein the machine-learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer. 98. The method of any one of embodiments 86 to 97, wherein the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof. 99. The method of any one of embodiments 86 to 98, wherein the cancer is a solid cancer. 100. The method of embodiment 99, wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer. 101. The method of any one of embodiments 86 to 98, wherein the cancer is a blood cancer. 102. The method of embodiment 101, the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post- transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia.
Attorney Docket No.225234-718601/PCT 103. The method of any one of embodiments 86 to 102, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer. 104. The method of embodiment 103, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof. 105. The method of embodiment 80 or 103, wherein the treatment comprises ABVD, AC, ATO, ATRA, Abemaciclib (Verzenois), Abiraterone (Zytiga), Abraxane, Abstral, Acalabrutinib, Actimorph, Actinomycin D, Actiq, Adriamycin, Afatinib (Giotrif), Afinitor, Aldara, Aldesleukin (IL-2, Proleukin or interleukin 2), Alectinib, Alectinib (Alecensa), Alemtuzumab (Campath, MabCampath), Alkeran, Amsacrine (Amsidine, m-AMSA), Amsidine, Anastrazole (Arimidex), Apalutamide, Ara C, Arimidex, Aromasin, Arsenic trioxide (Trisenox, ATO), Asparaginase (Spectrila, Erwinase, Oncaspar), Atezolizumab, Avelumab, Axitinib (Inlyta), Azacitidine (Vidaza, Onureg), BEACOPP, BEAM, Bendamustine (Levact), Besponsa, Bevacizumab (Avastin), Bexarotene (Targretin), Bicalutamide (Casodex), Bleomycin, Bleomycin, etoposide and platinum (BEP), Blinatumomab (Blincyto), Bortezomib (Velcade), Bortezomib thalidomide and dexamethasone (VTD), Bortezomib, cyclophosphamide and dexamethasone (VCD), Bortezomib, melphalan and prednisalone (VMP), Bosulif, Bosutinib (Bosulif), Brentuximab, Brigatinib (Alunbrig), Buserelin (Suprefact), Busulfan, CAPE-OX, CAPOX, CAV, CCNU, CHOP, Cabazitaxel (Jevtana), Cabometyx, Cabozantinib (Cometriq, Cabometyx), Caelyx, Calpol, Campto, Capecitabine (Xeloda), Caprelsa, CarboTaxol, Carboplatin, Carboplatin and etoposide, Carboplatin and paclitaxel, Carfilzomib and dexamethasone, Carmustine (BCNU), Casodex, Cemiplimab, Ceritinib (Zykadia), Cetuximab (Erbitux), Chlorambucil (Leukeran), Cisplatin, Cisplatin and capecitabine (CX), Cisplatin and fluorouracil (5FU), Cisplatin, etoposide and ifosfamide (VIP), Cisplatin, fluorouracil (5FU) and trastuzumab, Cladribine, Clasteon, Co- codamol (Kapake, Solpadol, Tylex), Cometriq, Cosmegen, Crisantaspase, Crizotinib (Xalkori), Cyclophosphamide, Cyclophosphamide, thalidomide and dexamethasone (CTD), Cyprostat, Cyproterone acetate, Cytarabine (Ara C, cytosine arabinoside), Cytarabine into spinal fluid (intrathecal cytarabine), Cytosine arabinoside, DHAP, DTIC, Dabrafenib (Tafinlar), Dabrafenib and trametinib, Dacarbazine (DTIC), Dacomitinib, Dactinomycin (actinomycin D), Daratumumab, Darolutamide (Nubeqa), Darzalex, Dasatinib (Sprycel), Daunorubicin, Daunorubicin, cytarabine and midostaurin, Decapeptyl SR, Degarelix (Firmagon), Denosumab (Prolia, Xgeva), Dexamethasone, Diamorphine, Disodium pamidronate, Disprol, Docetaxel (Taxotere), Docetaxel, cisplatin and fluorouracil (TPF), Doxifos, Doxorubicin (Adriamycin), Doxorubicin and ifosfamide, Durogesic, Durvalumab (Imfinzi), EC, ECF, EOF, EOX, EP (Etoposide and cisplatin), ESHAP, Effentora, Encorafenib and binimetinib, Encorafenib and cetuximab, Entrectinib (Rozlytrek), Enzalutamide, Epirubicin, Epirubicin cisplatin and capecitabine (ECX), Epirubicin, carboplatin and capecitabine (ECarboX), Erbitux, Eribulin (Halaven), Erlotinib (Tarceva), Erwinase, Etopophos, Etoposide (Etopophos), Everolimus
Attorney Docket No.225234-718601/PCT (Afinitor), Evoltra, Exemestane (Aromasin), FOLFIRINOX, FOLFOX, FOLFOXIRI, Faslodex, Femara, Fentanyl, Firmagon, Fludara, Fludarabine (Fludara), Fludarabine, cyclophosphamide and rituximab (FCR), Fluorouracil (5FU), Fluorouracil (5FU) and mitomycin C, Fluorouracil, Leucovorin, Oxaliplatin and Docetaxel (FLOT), Flutamide, Folinic acid, fluorouracil and irinotecan (FOLFIRI), Fotivda, Fulvestrant (faslodex), G-CSF, Gefitinib (Iressa), GemCarbo (gemcitabine and carboplatin), GemTaxol, Gemcitabine (Gemzar), Gemcitabine and capecitabine (GemCap), Gemcitabine and cisplatin (GC), Gemcitabine and nab-paclitaxel, Gemcitabine and paclitaxel (GemTaxol), Gemtuzumab ozogamicin, daunorubicin and cytarabine, Gemzar, Giotrif, Gliadel (carmustine wafers), Glivec, Gonapeptyl Depot, Goserelin (Zoladex) for breast cancer, Goserelin for prostate cancer, Granulocyte colony stimulating factor (G-CSF), Halaven, Herceptin, Herzuma, Hycamtin, Hydrea, Hydrocortisone, Hydroxycarbamide (Hydrea), Hydroxyurea, ICE, IL-2, IPE, Ibandronic acid (Bondronat), Ibrance, Ibrutinib (Imbruvica), Ibuprofen, Iclusig, Idarubicin, Ifosfamide (Mitoxana), Imatinib (Glivec), Imiquimod cream (Aldara), Inotuzumab ozogamicin, Interleukin, Ipilimumab (Yervoy), Ipilimumab and nivolumab, IrCap, Iressa, Irinotecan (Campto), Irinotecan and capecitabine (Xeliri), Irinotecan de Gramont, Irinotecan modified de Gramont, Ixazomib lenalidomide and dexamethasone, Jevtana, Kadcyla, Kapake, Keytruda, Kisqali, Lanreotide (Somatuline), Larotrectinib (Vitrakvi), Lenalidomide (Revlimid), Lenvatinib, Letrozole (Femara), Leukeran, Leuprorelin, Leustat, Levact, Liposomal doxorubicin, Litak, Lomustine, Lonsurf, Lorlatinib (Lorviqua), Lutrate, Lynparza, Lysodren, MAP, MMM, MPT, MST Continus, MVAC, MXL, MabCampath, Mabthera, Medroxyprogesterone acetate (Provera), Megace, Megestrol acetate (Megace), Melphalan (Alkeran), Mepact, Mercaptopurine (Xaluprine), Methotrexate, Methylprednisolone, Mifamurtide, Mitomycin C, Mitotane (Lysodren), Mitoxana, Mitoxantrone (Mitozantrone), Mobocertinib, Mobocertinib (Exkivity), Modified de Gramont, Morphgesic SR, Morphine, m-AMSA, Nab paclitaxel (Abraxane), Navelbine, Nelarabine (Atriance), Neratinib (Nerlynx), Nerlynx, Nexavar, Nilotinib (Tasigna), Nintedanib, Nipent, Niraparib (Zejula), Nivolumab (Opdivo), Obinutuzumab (Gazyvaro), Octreotide, Olaparib (Lynparza), Ontruzant, Onureg, Opdivo, Oramorph, Osimertinib (Tagrisso), OxCap, Oxaliplatin, Oxaliplatin and capecitabine (Xelox), PC, PE, PMitCEBO, POMB/ACE, Paclitaxel (Taxol), Paclitaxel and carboplatin (PC, CarboTaxol), Palbociclib (Ibrance), Pamidronate, Panadol, Panitumumab (Vectibix), Panobinostat, bortezomib and dexamethasone, Paracetamol, Pazopanib (Votrient), Peginterferon alfa 2a, Pembrolizumab (Keytruda), Pemetrexed (Alimta), Pemetrexed and carboplatin, Pemetrexed and cisplatin, Pemigatinib, Pentostatin (Nipent), Perjeta, Pertuzumab (Perjeta), Polatuzumab vedotin, bendamustine and rituximab (Pola-BR), Pomalidomide and dexamethasone, Ponatinib, Prednisolone, Procarbazine, Procarbazine, lomustine and vincristine (PCV), Proleukin, Prolia, Prostap, Provera, R-CHOP, R-CVP, R-DHAP, R-ESHAP, R-GCVP, R-Idelalisib (Zydelig), RICE, Raloxifene, Raltitrexed (Tomudex), Regorafenib (Stivarga), Revlimid, Ribociclib (Kisqali), Rituximab, Rixathon, Rubraca, Rucaparib (Rubraca), Ruxience, Ruxolitinib, Sevredol,
Attorney Docket No.225234-718601/PCT Sodium clodronate (Clasteon, Loron), Solpadol, Sorafenib, Steroids (dexamethasone, prednisolone, methylprednisolone and hydrocortisone), Stivarga, Streptozocin (Zanosar), Sunitinib (Sutent), Sutent, TIP, Tafinlar, Tagrisso, Talimogene laherparepvec (T-VEC), Tamoxifen, Tarceva, Targretin, Tasigna, Taxol, Taxotere, Taxotere and cyclophosphamide (TC), Tecentriq, Temodal, Temozolomide (Temodal), Tepadina, Tepotinib, Thalidomide, Thiotepa (Tepadina), Tivozanib (Fotivda), Tomudex, Topotecan (Hycamtin), Trabectedin (Yondelis), Trastuzumab (Herceptin), Trastuzumab and pertuzumab, Trastuzumab emtansine (Kadcyla), Treosulfan, Tretinoin (Vesanoid, ATRA), Trifluridine and tipiracil (Lonsurf), Triptorelin, Trisenox, Truxima, Tucatinib, trastuzumab and capecitabine, Tylex, VDC/IE, VIDE, Vargatef, VeIP, Vectibix, Velcade, Vemurafenib (Zelboraf), Venetoclax (Venclyxto), Vesanoid, Vidaza, Vinblastine, Vincristine, Vincristine, actinomycin D and cyclophosphamide (VAC), Vincristine, actinomycin D and ifosfamide (VAI), Vinorelbine (Navelbine), Votrient, XELOX, Xalkori, Xeloda, Xgeva, Xtandi, Yervoy, Yondelis, Zanosar, Zelboraf, Zoladex (breast cancer), Zoladex (prostate cancer), Zoledronic acid (Zometa), Zometa, Zomorph, Zydelig, or Zytiga or any combination thereof. [0339] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the present disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive. [0340] In certain embodiments of the current disclosure, methods and systems for assessing a solid tumor of a patient, using machine learning are disclosed. Biopsy of the solid tumor can be relatively difficult to perform. Non limiting examples of solid tumors for which biopsy is relatively difficult to perform can include tumors for which performing biopsy and/or collecting samples for biopsy require invasive and/or painful surgery. The methods and systems of the invention can be used to analyze solid tumors for which obtaining a biopsy is surgically difficult, clinically invasive, dangerous for the patient, or a combination thereof. The methods and systems of the invention, including gene sets identified by the inventive methods, optionally combined with clinical characteristics data as described herein, can be used as described to classify a solid tumor as malignant or benign, with a high accuracy, sensitivity, specificity, positive predictive value, negative predictive value, or a combination thereof, without the need for obtaining a biopsy. A solid tumor appropriate for analysis using the methods and systems of the present invention can be identified by one of skill in the art as desired. In some embodiments, the solid tumor is a sarcoma, carcinoma, or lymphoma. In certain embodiments, the solid tumor is a lung, pancreatic, ovarian, kidney or brain tumor. As shown in a non-limiting manner in the Examples, using gene expression measurements of a biological sample from the patient, and optionally clinical characteristics data of the patient, the machine learning (ML) methods of the current disclosure can classify the tumor. The biological sample can be a blood sample. The methods can have relatively high accuracy, specificity, selectivity, positive predictive value, and/or negative predictive value. Further, as shown in a non-limiting manner in Example 5, it was also found that, in some embodiments, using both gene expression data and clinical characteristics data compared to using gene expression data only,
Attorney Docket No.225234-718601/PCT predictive power (e.g. accuracy, specificity, selectivity, positive predictive value, and/or negative predictive value) of the machine learning models and the method can be improved. For example, as shown in FIG.17D, accuracy, specificity, selectivity, above 0.9 can be obtained with certain machine learning models using relatively fewer number of predictors containing gene and clinical characteristics. In certain embodiments, a treatment of cancer can be administered based on the results from machine learning classification. One of the potential benefits of certain embodiments of the current disclosures include is that a biopsy can be avoided in cases where the ML classification model outputs a high confidence that a solid tumor is benign or malignant. The benefit here is that in conventional techniques, a biopsy is always performed as it is the only way to determine whether the solid tumor is benign or malignant. However, biopsy procedure carries inherent risks, and the risks for a biopsy may outweigh the benefits for some patients but not others, based on their individual circumstances. The ML model can be used to better inform the clinician of whether the benefits of getting the biopsy outweigh the risks of a biopsy procedure (e.g., a situation where a biopsy can be avoided, can include where a patient is (1) at heightened risk of complications of a biopsy due to some other health-related condition or the location of the tumor and (2) the blood sample indicates that the solid tumor has high likelihood of being benign or malignant). The ability to avoid an unnecessary biopsy can also be considered a technical advantage and/or practical benefit. Methods for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy, and developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant, are also disclosed. [0341] While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed. [0342] The blood sample can be a whole blood sample, blood cells, serum, plasma, or any combination thereof. [0343] Tables 1, 2, 3, 4, 5, and 9 list lung disease-associated gene. Table 7 lists 31 lung disease- associated gene and 3 clinical characteristics. Table 8 lists 21 lung disease-associated gene and 1 clinical characteristics. Table 6 lists 8 clinical characteristics. Tables 1, 2, 3, 4, 5, 6, 7, 8 and 9, and all of contents of the Tables are incorporated as part of specification of the disclosure. [0344] In an aspect, the present disclosure provides a method for assessing a solid tumor of a patient. The method can include, any one of, any combination of, or all of steps a, b, c and d. Step a can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, ii) optionally clinical characteristics data of one or more clinical characteristics of the patient. Step b, can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or
Attorney Docket No.225234-718601/PCT a benign solid tumor. Step c, can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor. Step d, can include electronically outputting a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. The gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, and the like. [0345] The solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney, or brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor. In certain embodiments, the solid tumor is a pancreatic tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to pancreatic cancer. In some particular embodiments, the solid tumor is a pancreatic tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to ovarian cancer. In some particular embodiments, the solid tumor is an ovarian tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the solid tumor is a kidney tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to kidney cancer. In some particular embodiments, the solid tumor is a kidney tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a kidney tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a brain tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the solid tumor is a brain tumor, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to brain cancer. In some particular embodiments, the solid tumor is a brain tumor, and the at least two genes of the data set of step a, are selected from the gene set capable of
Attorney Docket No.225234-718601/PCT classifying a brain tumor as benign or malignant, and the one or more clinical characteristics of the data set of step a, are selected from a group of clinical characteristics related to brain cancer. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0346] In some embodiments, the solid tumor is a lung tumor, and the data set of step a, contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. In some embodiments, the at least two lung disease- associated genes are selected from the group of genes listed in Table 4. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 1. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 2. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 3. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 5. In some embodiments, the at least two lung disease- associated genes are selected from the group of genes listed in Table 7. In some embodiments, the at least two lung disease-associated genes are selected from the group of genes listed in Table 8. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10,
Attorney Docket No.225234-718601/PCT 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142 genes selected from the group of genes listed in Table 5. In some embodiments, the at least two lung disease-associated genes, e.g. as of step a, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7. In some embodiments, the at least two lung disease- associated genes of step a, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3. In some embodiments, the at least two lung disease- associated genes, e.g. as of step a, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8.. In some embodiments, the at least two lung disease-associated genes of step a, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4. These genes and those described herein are known to those of skill in the art, and described in the literature. Table A provides examples of Gene ID numbers for genes listed herein in the Tables, including Tables 7 and 8, as described in OMIM® - Online Mendelian Inheritance in Man (McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD) and in the National Center for Biotechnology Information gene database (NCBI, U.S. National Library of Medicine 8600 Rockville Pike, Bethesda MD, 20894 USA), each incorporated by reference herein in their entirety. [0347] In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics includes, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics includes size of the tumor (e.g. lung nodule). In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics of the patient includes age of the patient. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics includes presence of the nodule in the lung upper lobe. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the at least two lung disease-associated genes of step a, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step a, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the at least two lung disease-associated genes of step a, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step a, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the solid tumor is a lung tumor, and the data set of step a, contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-
Attorney Docket No.225234-718601/PCT associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and the data set of step a, contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of the clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. [0348] The biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the biological sample is a blood sample or any derivative thereof. In some embodiments, the biological sample is PBMCs or any derivative thereof. In some embodiments, the biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof. [0349] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0350] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%,
Attorney Docket No.225234-718601/PCT at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0351] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0352] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0353] The method can classify the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step b”, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
Attorney Docket No.225234-718601/PCT [0354] The machine learning model, e.g. of step b, can infer whether the data set is indicative of a malignant solid tumor or a benign solid tumor with a receiver operating characteristic (ROC) curve with an AUC of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0355] The inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the solid tumor is malignant. Higher confidence values may be correlated with a higher likelihood that the solid tumor is malignant. A malignant tumor may be characterized by or by having to ability to metastasize or grow invasively, which may be in contrast to benign tumor. [0356] In some embodiments, the patient has a cancer. In some embodiments, the patient does not have cancer. In some embodiments, the patient is suspected of having a cancer. In some embodiments, the patient is at an elevated risk of having a cancer. In some embodiments, the patient is asymptomatic for a cancer. Cancer can be lung cancer, pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer. In some embodiments, the patient has pancreatic cancer. In some embodiments, the patient does not have pancreatic cancer. In some embodiments, the patient is suspected of having pancreatic cancer. In some embodiments, the patient is at an elevated risk of having a pancreatic cancer. In some embodiments, the patient is asymptomatic for pancreatic cancer. In some embodiments, the patient has ovarian cancer. In some embodiments, the patient does not have ovarian cancer. In some embodiments, the patient is suspected of having ovarian cancer. In some embodiments, the patient is at an elevated risk of having ovarian cancer. In some embodiments, the patient is asymptomatic for ovarian cancer. In some embodiments, the patient has kidney cancer. In some embodiments, the patient does not have kidney cancer. In some embodiments, the patient is suspected of having kidney cancer. In some embodiments, the patient is at an elevated risk of having a kidney cancer. In some embodiments, the patient is asymptomatic for kidney cancer. In some embodiments, the patient has brain cancer. In some embodiments, the patient does not have brain cancer. In some embodiments, the patient is suspected of having brain cancer. In some embodiments, the patient is at an elevated risk of having brain cancer. In some embodiments, the patient is asymptomatic for brain cancer. In some embodiments, the patient has a lung cancer. In some embodiments, the patient does not have lung cancer. In some embodiments, the patient is suspected of having a lung cancer. In some embodiments, the patient is at an elevated risk of having a lung cancer. In some embodiments, the patient is asymptomatic for a lung cancer. [0357] In certain embodiments, the method includes optionally performing biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In certain embodiments, the method includes optionally performing a biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. In some embodiments, a biopsy is performed. In
Attorney Docket No.225234-718601/PCT some embodiments, a biopsy is not performed. The decision to perform a biopsy may be made by one of skill in the art, based on knowledge and experience, in view of the classification of the lung nodule of the patient as the malignant lung nodule or the benign lung nodule. The decision to perform a biopsy may depend in part on the confidence value of the inference. In certain embodiments, biopsy of the solid tumor of the patient is not performed. [0358] In some embodiments, the method further contains administering a treatment to the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In some embodiments, the method contains administering a treatment to the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. In some embodiments, the treatment is configured to treat a cancer of the patient. In some embodiments, the treatment is configured to reduce a severity of a cancer of the patient. In some embodiments, the treatment is configured to reduce a risk of having a cancer of the patient. The treatment can include one or more treatments of cancer. The cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the treatment can be treatment for the lung, pancreatic, ovarian, or brain cancer respectively. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the pancreatic tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method comprises administering a treatment of pancreatic cancer to the patient based at least in part on the classification of the pancreatic tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the ovarian tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method include administering a treatment of ovarian cancer to the patient based at least in part on the classification of the ovarian tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the kidney tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method comprises administering a treatment of kidney cancer to the patient based at least in part on the classification of the kidney tumor of the patient as the malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the brain tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method includes administering a treatment of brain cancer to the patient based at least in part on the classification of the brain tumor of the patient as malignant tumor. In some embodiments, the method includes administering a treatment to the patient based at least in part on the classification of the lung tumor of the patient as the malignant tumor or the benign tumor. In some embodiments, the method comprises administering a treatment of lung cancer to the patient based at least in part on the classification of the lung tumor of the patient as the malignant tumor. The treatment can include one or more treatments of cancer. In some embodiments, the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0359] The machine learning model of step b, can be a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant. The machine-learning model, e.g. of step b,
Attorney Docket No.225234-718601/PCT can generate the inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor, by comparing the data set to a reference data set. The machine-learning model can be trained using the reference data set according to the methods described herein. In some embodiments, the reference data set can contain gene expression measurements of a plurality of reference biological samples from a plurality of reference subjects having solid tumor, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, data regarding whether the solid tumors of the reference subjects are benign or malignant, and optionally clinical characteristics data of one or more clinical characteristics of the reference subjects. A first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor. In some embodiments, the reference data set contains a plurality of individual reference data sets. A respective individual reference data set of the plurality of individual reference data sets can contain i) gene expression measurements of a reference biological sample from a reference subject having a reference solid tumor of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject. The plurality of individual reference data sets can be obtained from the plurality of reference subjects. In some embodiments, different individual reference data sets are obtained from different reference subjects. In some embodiments, each of the individual reference data set contains i) gene expression measurements of a reference biological sample from one reference subject of the at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, ii) data regarding whether the lung nodule of the one reference subject is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, and, wherein different individual reference data sets are obtained from different reference subjects. In certain embodiments, oversampling or undersampling correction can be made during training of the machine learning model. For example, if a reference data set includes a greater number of samples identified as benign and a relatively fewer number of samples identified as malignant, the malignant samples may be oversampled to produce a data set that has equal number of benign and malignant samples. The reference tumor, and solid tumor, can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor. In certain embodiments, the solid tumor is a pancreatic tumor, the method can classify a pancreatic tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the solid tumor is a pancreatic tumor, the method can classify a pancreatic tumor as malignant or a benign, and the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, the method can classify a pancreatic tumor as malignant or a benign, and the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a
Attorney Docket No.225234-718601/PCT pancreatic tumor as benign or malignant, and ii) clinical characteristic data of one or more characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is ovarian tumor, the method can classify an ovarian tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the solid tumor is an ovarian tumor, the method can classify an ovarian tumor as malignant or a benign, and the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, the method can classify an ovarian tumor as malignant or a benign, and the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, the method can classify a kidney tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the solid tumor is a kidney tumor, the method can classify a kidney tumor as malignant or a benign, and the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a kidney tumor, the method can classify a kidney tumor as malignant or a benign, and the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristic data of one or more characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a brain tumor, the method can classify a brain tumor as malignant or a benign, and the reference data set can contain gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the solid tumor is a brain tumor, the method can classify a brain tumor as malignant or a benign, and the reference data set can contain clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, the method can classify a brain tumor as malignant or a benign, and the reference data set can contain i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristic data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In some embodiments, the genes of the data set and genes of the reference data set can at least partially overlap. In some embodiments, clinical characteristics of the data set and clinical characteristics of the reference data set can at least partially overlap. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein.
Attorney Docket No.225234-718601/PCT [0360] In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the gene set of reference data set is the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and one or more clinical characteristics of the reference data set is selected from the group of clinical characteristics listed in Table 6. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7. In some embodiments, the solid tumor is a lung tumor, the at least two genes of the reference data set comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of the reference data set comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the at least two genes of the reference data set
Attorney Docket No.225234-718601/PCT consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of the reference data set consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21, genes selected from the group of genes listed in Table 8. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set include, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set includes size of the nodule. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set includes age of the patient. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set includes presence of the nodule in the lung upper lobe. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, and the one or more clinical characteristics of the reference data set includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and the one or more clinical characteristics of the reference data set includes 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6 In certain embodiments, the solid tumor is a lung tumor, the method can classify a lung tumor as malignant or a benign, the at least two genes of the reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and the one or more clinical characteristics of the reference data set include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. The genes of the data set and genes of the reference data set can at least partially overlap and/or the optional clinical characteristics of the data set and optional clinical characteristics of the reference data set can at least partially overlap. [0361] The reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative
Attorney Docket No.225234-718601/PCT thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof. The reference subjects can be human. [0362] Gene expression data can be obtained by a data analysis tool selected from the group: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring ™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope). [0363] In some embodiments, the trained machine learning model, e.g. of step b, is a supervised machine learning algorithm or an unsupervised machine learning algorithm. In some embodiments, the trained machine learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof. In some embodiments, the trained machine learning model is trained using LOG. In some embodiments, the trained machine learning model is trained using Ridge regression. In some embodiments, the trained machine learning model is trained using Lasso regression. In some embodiments, the trained machine learning model is trained using GLM. In some embodiments, the trained machine learning model is trained using kNN. In some embodiments, the trained machine learning model is trained using SVM. In some embodiments, the trained machine learning model is trained using GBM. In some embodiments, the trained machine learning model is trained using RF. In some embodiments, the trained machine learning model is trained using NB. In some embodiments, the trained machine learning model is trained using the EN regression. In some embodiments, the trained machine learning model is trained using neural network. In some embodiments, the trained machine learning model is trained using deep learning algorithm. In some embodiments, the trained machine learning model is trained using LDA. In some embodiments, the trained machine learning model is trained using DTREE. In some embodiments, the trained machine learning model is trained using ADB. [0364] In some embodiments, the method comprises determining a likelihood of the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In some embodiments, the likelihood is about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 91%, about 92%, about 93%, about 94%, about 95%, about 96%, about 97%, about 98%, about 99%, or about 100%. In some embodiments, the likelihood is least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
Attorney Docket No.225234-718601/PCT [0365] In some embodiments, the method further comprises monitoring the solid tumor of the patient, wherein the monitoring comprises assessing the solid tumor of the patient at a plurality of time points. In some embodiments, a difference in the assessment of the solid tumor of the patient among the plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (i) a diagnosis of the solid tumor of the patient, (ii) a prognosis of the solid tumor of the patient, and (iii) an efficacy or non-efficacy of a course of treatment for treating the solid tumor of the patient. In some embodiments, the plurality of time points comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or 50 different time points. [0366] In an aspect, the present disclosure provides a method for determining a gene set capable of classifying a solid tumor, as benign or malignant. Gene expression measurements of one or more genes of the gene set of a biological sample (e.g. blood) from a patient can be used to classify a solid tumor of the patient, as benign or malignant without performing biopsy of the solid tumor. In some embodiments, a biopsy of the solid tumor can be performed to confirm and/or follow-up the classification results obtained by using the gene expression measurement data. In some embodiments, a biopsy of the solid tumor is not performed. The method can include any one of, any combination of, or all of steps a’, b’, c’ and d’. In step a’, a reference data set can be obtained and/or provided. [0367] The reference data set can contain i) gene expression measurements of a plurality of genes of reference biological samples from reference subjects each having at least one reference solid tumor, ii) data regarding whether the reference tumors are benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subjects. The reference data set can contain a plurality of individual reference data sets. A respective individual reference data set of the plurality of individual reference data sets can contain i) gene expression measurements of the plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference tumor is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the reference subject. The plurality of individual reference data sets can be obtained from a plurality of reference subjects. In some embodiments, different individual reference data sets can be obtained from different reference subjects. In some embodiments, each of the individual reference data set contains i) gene expression measurements of the plurality of genes of a reference biological sample from one reference subject, ii) data regarding whether the solid tumor of the one reference subject is benign or malignant, and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, wherein different individual reference data sets are obtained from different reference subjects. A first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor. In step b’, a machine learning model can be trained using the reference data set to infer whether a solid tumor is benign or malignant based on at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics. In some embodiments, the machine learning model can be trained using a training
Attorney Docket No.225234-718601/PCT data set containing a first portion of the reference data set, and a validation data set containing a second portion of the reference data set. In certain embodiments, oversampling or undersampling correction can be made during training of the machine learning model. For example, if a data set includes a greater number of samples identified as benign and a relatively fewer number of samples identified as malignant, the malignant samples may be oversampled to produce a data set that has equal number of benign and malignant samples. In step c’, feature importance values of the plurality of genes can be determined. In step d’, the gene set can be selected. The gene set can be selected as predictors that are used to train the machine learning model. The gene set, may be selected based at least in part on feature importance values. In some embodiments, the feature importance values of the genes of the gene set, are within top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, feature importance values of the plurality of genes. In some embodiments, the feature importance of the genes of the gene set, can have accuracy, greater than 30 %, 35 %, 40 %, 45 %, 50 %, 55 %, 60 %, 65 %, 70 %, 75 %, or 80 % or 90 %. In some embodiments, the feature importance of the genes of the gene set, can have threshold importance, greater than 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80 or 90. In certain embodiments, the top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the machine learning model include the genes of the gene set. [0368] The solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney tumor, or brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor. The reference tumor, and solid tumor, can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes
Attorney Docket No.225234-718601/PCT of the reference data set of step a’, include at least 2 genes selected from the group of genes listed in Table 5. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 300, 400, 500, 600, 700, 800, 900, 1000, 1100 or 1178, or any value or range there between, genes selected from the group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the reference data set of step a’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics of the reference data set of step a’, include 1, 2, 3, 4, 5, 6, 7, or 8, clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, the plurality of genes of the reference data set of step a’, include at least 2 genes selected from a group of genes listed in any one or more of Tables 1, 2, 3, 4, 5 and 9, and the one or more clinical characteristics of the reference data set of step a’, include 1, 2, 3, 4, 5, 6, 7, or 8 clinical
Attorney Docket No.225234-718601/PCT characteristics selected from the group of clinical characteristics listed in Table 6. In certain embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to pancreatic cancer. In some particular embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to pancreatic cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to ovarian cancer. In some particular embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to ovarian cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to kidney cancer. In certain embodiments, the solid tumor is a kidney tumor, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to kidney cancer. In some particular embodiments, the solid tumor is a kidney tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to kidney cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a brain tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, and the one or more clinical characteristics of the reference data of step a’, set are selected from a group of clinical characteristics related to brain cancer. In some particular embodiments, the solid tumor is a brain tumor, and the plurality of genes of the reference data set of step a’, contains at least 2 genes selected from a group of genes related to brain cancer, and the one or more clinical characteristics of the reference data set of step a’, are selected from a group of clinical characteristics related to brain cancer. In some embodiments, genes having collinear expression with correlation coefficients (e.g. in non-limiting aspects > 0.7 to > 0.9) were removed from the reference data set. Collinear gene expression can be measured by any suitable technique, e.g. Pearson correlation coefficient. While determining feature importance values for the plurality of genes has been described, this is merely a non-limiting illustrative example of a technique that can be practiced. In various embodiments, one or more feature selection techniques are used to determine the gene set that can classify a solid tumor benign or malignant. Feature selection techniques can include least absolute
Attorney Docket No.225234-718601/PCT shrinkage and selection operator (Lasso) regression, support vector machine (SVM), regularized trees, decision trees, memetic algorithm, random multinomial logit (RMNL), auto-encoding networks, submodular feature selection, recursive feature elimination, or any combination thereof. In some of these cases, feature importance values need not be calculated for each of the genes of the plurality of genes. The reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof. The reference subjects can be human. [0369] The machine learning model, e.g. of step b’, can be trained using a supervised machine learning algorithm or an unsupervised machine learning algorithm. In some embodiments, the machine learning model, e.g. of step b’, is trained using linear regression, logistic regression, Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof. In some embodiments, the machine learning model is trained using logistic regression. In some embodiments, the machine learning model is trained using Ridge regression. In some embodiments, the machine learning model is trained using Lasso regression. In some embodiments, the machine learning model is trained using GLM. In some embodiments, the machine learning model is trained using kNN. In some embodiments, the machine learning model is trained using SVM. In some embodiments, the machine learning model is trained using GBM. In some embodiments, the machine learning model is trained using RF. In some embodiments, the machine learning model is trained using NB. In some embodiments, the machine learning model is trained using the EN regression. In some embodiments, the machine learning model is trained using neural network. In some embodiments, the machine learning model is trained using deep learning algorithm. In some embodiments, the machine learning model is trained using LDA. In some embodiments, the machine learning model is trained using DTREE. In some embodiments, the machine learning model is trained using ADB. [0370] The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about
Attorney Docket No.225234-718601/PCT 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The gene set can classify a solid tumor as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0371] In another aspect, the present disclosure provides a method for developing a trained machine learning model capable of classifying a solid tumor of a patient, as benign or malignant. The method can include any one of, any combination of, or all of steps a”, b”, c”, d” and e”. Step a”, can include obtaining and/or providing a first reference data set. In some embodiments, the first reference data set can contain i) gene expression measurements of a plurality of genes of reference biological samples from reference subjects each having at least one reference solid tumor, ii) data regarding whether the reference tumors are benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subjects. The first reference data set can contain a plurality of first individual reference data sets. A respective first individual reference data set of the plurality of first individual reference data sets can contain i) gene expression measurements of the plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) clinical characteristics data of the one or more clinical characteristics of the reference subject. The plurality of first individual reference data sets can be obtained from a plurality of reference subjects. In some embodiments, different first individual reference data sets are obtained from different reference subjects. In some embodiments, each of the first individual reference data set contains i) gene expression measurements of the plurality of genes of a reference biological sample from one reference subject, iii)
Attorney Docket No.225234-718601/PCT data regarding whether the reference solid tumor of the one reference subject is benign or malignant and iii) optionally clinical characteristics data of the one or more clinical characteristics of the one reference subject, wherein different first individual reference data sets are obtained from different reference subjects. A first portion of the plurality of reference subjects can have benign solid tumor, and a second portion of the plurality of reference subjects can have malignant solid tumor. In step b”, a first machine learning model can be trained using the first reference data set to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics. The first machine learning model can be trained to infer whether the solid tumor is benign or malignant, based at least in part on the measurement data of the plurality of genes, and optionally the clinical characteristics data of the one or more clinical characteristics. In some embodiments, the first machine learning model can be trained using a training data set containing a first portion of the first reference data set, and a validation data set containing a second portion of the first reference data set. In step c’, feature importance values of one or more predictors of the first machine learning model can be determined. In step d’, A predictors of the first machine learning model based at least in part on the feature importance values can be selected, where A can be an integer from 3 to 2000, such as 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, 250, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, or 2000, or any integer value therein. In certain embodiments, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the first machine learning model can be selected. In some embodiments, the A predictors can have top A feature importance values, for example, in a non-limiting aspect, A can be 10, and 10 predictors having 10 highest feature importance values can be selected. In some embodiments, the feature importance of the A predictors, can have an accuracy, greater than 30 %, 35 %, 40 %, 45 %, 50 %, 55 %, 60 %, 65 %, 70 %, 75 %, 80 % or 90%. In some embodiments, the feature importance of the A predictors, can have a threshold importance, greater than 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80 or 90. In some embodiments, the A predictors form top 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 210, 220, 230, 240, or 250, or any value or range there between, predictors of the first machine learning model. A predictors can include one or more genes, and/or optionally one or more clinical characteristics. While determining feature importance values for the one or more predictors has been described, this is merely a non-limiting illustrative example of a technique that can be practiced. In various embodiments, one or more feature selection techniques are used to determine the A predictors.
Attorney Docket No.225234-718601/PCT Feature selection techniques can include least absolute shrinkage and selection operator (Lasso) regression, support vector machine (SVM), regularized trees, decision trees, memetic algorithm, random multinomial logit (RMNL), auto-encoding networks, submodular feature selection, recursive feature elimination, or any combination thereof. In some of these cases, in step c”, feature importance values need not be calculated for each of the predictors of first machine learning model. Step e”, can include training a second machine learning model based at least in part on a second reference data set to obtain the trained machine learning model. The second reference data set can contain i) measurement data of the A predictors of the reference subjects, and ii) data regarding whether the solid tumors of the reference subjects are benign or malignant. The second reference data set can contain a plurality of second individual reference data sets. A respective second individual reference data set of the plurality of second individual reference data sets can include i) measurement data of the A predictors of a reference subject, and ii) data regarding whether the solid tumor of the reference subject is benign or malignant. The plurality of second individual reference data sets can be obtained from the plurality of reference subjects. In some embodiments, different second individual reference data sets are obtained from different reference subjects. In some embodiments, each of the second individual reference data set contains i) measurement data of the A predictors of one reference subject, and ii) data regarding whether the solid tumor of the one reference subject is benign or malignant, wherein different second individual reference data sets are obtained from different reference subjects. Measurement data of the A predictors can include, gene expression measurements in the reference sample of the one or more genes features of the A predictors, and/or optionally clinical characteristics data of one or more clinical characteristics features of the A predictors. The trained machine learning model can infer whether a solid tumor is benign or malignant, based at least in part on measurement data of the A predictors. In some embodiments, the one or more genes features of the A predictors can form the gene set capable of classifying a solid tumor, as benign or malignant. In certain embodiments, oversampling or undersampling correction can be made during training of the first and/or second machine learning model. [0372] The solid tumor can be a lung tumor, pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor. The reference tumor, and solid tumor, can be of same type of tumor, such as both can be lung tumor, pancreatic tumor, both can be ovarian tumor, both can be kidney tumor, or both can be brain tumor. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from a group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290,
Attorney Docket No.225234-718601/PCT 300, 400, 500, 600, 700, 800, 900, 1000, 1100 or 1178, or any value or range there between, genes selected from the group of genes listed in Table 9. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from the group of genes listed in Table 5. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics of the first reference data set include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and the plurality of genes of the first reference data set include at least 2 genes selected from a group of genes listed in any one or more of Tables 1, 2, 3, 4, 5 and 9, and the one or more clinical characteristics of the first reference data set include 1, 2, 3, 4, 5, 6, 7, or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In certain embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to pancreatic cancer. In some particular embodiments, the solid tumor is a pancreatic tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to pancreatic cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from pancreatic cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to ovarian cancer. In some particular embodiments, the solid tumor is an ovarian tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to ovarian cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to kidney cancer. In certain embodiments, the solid tumor is a kidney tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to kidney cancer. In some particular embodiments, the solid tumor is a kidney tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to kidney cancer, and the one or
Attorney Docket No.225234-718601/PCT more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from kidney cancer. In certain embodiments, the solid tumor is a brain tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related to brain cancer. In some particular embodiments, the solid tumor is a brain tumor, and the plurality of genes of the first reference data set contains at least 2 genes selected from a group of genes related to brain cancer, and the one or more clinical characteristics of the first reference data set are selected from a group of clinical characteristics related from brain cancer. In some embodiments, genes having collinear expression with correlation coefficients (e.g. in non-limiting aspects > 0.7 to > 0.9) were removed from the reference data set. Collinear gene expression can be measured by any suitable technique, e.g. Pearson correlation coefficient. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes of genes selected from the group listed in Table 7, as predictors. In some embodiments, the solid tumor is a lung
Attorney Docket No.225234-718601/PCT tumor, and the A predictors can at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors include i) at least, 2 genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8and ii) optionally size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof, as predictors. In some embodiments, the solid tumor is a lung tumor, and the A predictors can include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33 or 34 predictors selected from the group listed in Table 7. In some embodiments, the solid tumor is a lung tumor, the A predictors comprise the 34 predictors listed in Table 7. In some embodiments, the solid tumor is a lung tumor, the A predictors consist the 34 predictors listed in Table 7. [0373] The reference biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the reference biological sample is a blood sample or any derivative thereof. In some embodiments, the reference biological sample is PBMCs or any derivative thereof. In some embodiments, the reference biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the reference biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the reference biological sample is a saliva sample, or any derivative thereof. In some embodiments, the reference biological sample is a urine sample, or any derivative thereof. In some embodiments, the reference biological sample is a stool sample, or any derivative thereof. In some embodiments, the reference biological sample is CSF sample, or any derivative thereof. [0374] The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about
Attorney Docket No.225234-718601/PCT 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The trained machine learning model, e.g. obtained in step e”, can infer whether a solid tumor is benign or malignant with a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0375] Gene expression data can be obtained by a data analysis tool selected from the group: a BIG-C™ big data analysis tool, an I-Scope™ big data analysis tool, a T-Scope™ big data analysis tool, a Cell Scan big data analysis tool, an MS (Molecular Signature) Scoring ™ analysis tool, and a Gene Set Variation Analysis (GSVA) tool (e.g., P-Scope). [0376] In some embodiments, the trained machine learning model is a supervised machine learning algorithm or an unsupervised machine learning algorithm. In some embodiments, first and/or second machine-learning model is independently trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB), neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), or any combination thereof. In some embodiments, the first and/or second machine-learning model is independently trained using LOG. In some embodiments, the first and/or second machine-learning model is independently trained using Ridge regression. In some embodiments, the first and/or second machine-learning model is independently trained using Lasso regression. In some embodiments, the first and/or second machine-learning model is independently trained using GLM. In some embodiments, the first and/or second machine-learning model is independently trained using kNN. In some embodiments, the first and/or second machine-learning model is independently trained using SVM. In some embodiments, the first and/or second machine- learning model is independently trained using GBM. In some embodiments, the first and/or second machine-learning model is independently trained using RF. In some embodiments, the first and/or second machine-learning model is independently trained using NB. In some embodiments, the first and/or second machine-learning model is independently trained using the EN regression. In some embodiments, the first and/or second machine-learning model is independently trained using neural network. In some embodiments, the first and/or second machine-learning model is independently trained using deep learning algorithm. In some embodiments, the first and/or second machine-learning model is
Attorney Docket No.225234-718601/PCT independently trained using LDA. In some embodiments, the first and/or second machine-learning model is independently trained using DTREE. In some embodiments, the first and/or second machine-learning model is independently trained using ADB. [0377] In an aspect, the present disclosure provides a method for treating cancer in a patient. In some embodiments, the patient has a solid tumor. The method can include, any one of, any combination of, or all of steps a”’, b”’, c”’ and d”’. Step a”’, can include obtaining a data set containing i) gene expression measurements of a biological sample from the patient, of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient. The gene expression measurements can be obtained by assaying the biological sample. Step b’”, can include providing the data set as input to a machine- learning model trained to generate an inference of whether the data set is indicative of the patient having or not having cancer. In some embodiments, the inference infer whether the data set is indicative of the solid tumor of the patient is malignant or benign. Step c”’, can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the patient having or not having cancer. In some embodiments, the inference received as an output, indicate whether the solid tumor of the patient is malignant or the benign. Step d”’, can include administering a treatment based on the determination that the patient has cancer. In some embodiments, the treatment is be administering based on the patient’s solid tumor being classified as malignant. [0378] The cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the solid tumor can be a pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor respectively. In certain embodiments, the cancer is lung cancer, and the solid tumor is a lung tumor. In certain embodiments, the cancer is pancreatic cancer, and the solid tumor is a pancreatic tumor. In certain embodiments, the cancer is ovarian cancer, and the solid tumor is an ovarian tumor. In certain embodiments, the cancer is kidney cancer, and the solid tumor is a kidney tumor. In certain embodiments, the cancer is brain cancer, and the solid tumor is a brain tumor. In some embodiments, the cancer is lung cancer, the gene set of reference data set is the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and optional one or more clinical characteristics of the reference data set is selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, the dataset of step a”’, contains i) gene expression measurements of the biological sample from the patient, of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8, and ii) optionally clinical characteristics data of one or more clinical characteristics selected from the group of clinical characteristics listed in Table 6 of the patient. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the
Attorney Docket No.225234-718601/PCT data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7 In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a’”, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step a”’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8. In some embodiments, the cancer is lung cancer, the at least two lung disease-associated genes of step a”’, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4. In some embodiments, the cancer is lung cancer, the one or more clinical characteristics of the data set of step a’”, include 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the data set of step a’”, include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the cancer is lung cancer, and the data set of step a”’, contains i) gene expression
Attorney Docket No.225234-718601/PCT measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, and the data set of step a”’, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the cancer is lung cancer, the at least two lung disease-associated genes of step a’”, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step a’” comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the cancer is lung cancer, the at least two lung disease-associated genes of step a’, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step a”” consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. [0379] In certain embodiments, the cancer is pancreatic cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the cancer is pancreatic cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the cancer is pancreatic cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the cancer is ovarian cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the cancer is ovarian cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the cancer is ovarian cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the cancer is kidney cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the cancer is kidney cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the cancer is kidney cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical
Attorney Docket No.225234-718601/PCT characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the cancer is brain cancer, and the data set of step a”’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the cancer is brain cancer, and the data set of step a”’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In certain embodiments, the cancer is brain cancer, and the data set of step a”’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0380] The inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the tumor is malignant, where higher confidence values may be correlated with a higher likelihood that the tumor is malignant. A malignant tumor may be characterized by the ability to metastasize or grow invasively, which may be in contrast to benign nodules. [0381] The biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the biological sample is a blood sample or any derivative thereof. In some embodiments, the biological sample is PBMCs or any derivative thereof. In some embodiments, the biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof. [0382] In certain embodiments, the method includes optionally performing biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. In certain embodiments, the method includes optionally performing a biopsy of the solid tumor of the patient based at least in part on the classification of the solid tumor of the patient as the malignant solid tumor. The decision to perform biopsy may depend on confidence value of the inference. In certain embodiments, biopsy of the solid tumor of the patient is not performed. The machine-learning model, e.g. of step b”’, can generate the inference of whether the data set is indicative of the patient having a malignant solid tumor or a benign solid tumor, wherein the patient having malignant solid tumor may indicate the patient has cancer, and the patient having benign solid tumor may indicate the patient does not have cancer. The machine-learning model of step b”’, can be
Attorney Docket No.225234-718601/PCT trained according to a method described herein, e.g. according to the methods training of the machine- learning model of step b. [0383] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning
Attorney Docket No.225234-718601/PCT model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with an accuracy of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0384] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about
Attorney Docket No.225234-718601/PCT 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a sensitivity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0385] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to
Attorney Docket No.225234-718601/PCT about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a specificity of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0386] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %,
Attorney Docket No.225234-718601/PCT about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a positive predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0387] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of about 80 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a
Attorney Docket No.225234-718601/PCT negative predictive value of about 80 % to about 85 %, about 80 % to about 90 %, about 80 % to about 92 %, about 80 % to about 94 %, about 80 % to about 95 %, about 80 % to about 96 %, about 80 % to about 97 %, about 80 % to about 98 %, about 80 % to about 99 %, about 80 % to about 99.5 %, about 80 % to about 100 %, about 85 % to about 90 %, about 85 % to about 92 %, about 85 % to about 94 %, about 85 % to about 95 %, about 85 % to about 96 %, about 85 % to about 97 %, about 85 % to about 98 %, about 85 % to about 99 %, about 85 % to about 99.5 %, about 85 % to about 100 %, about 90 % to about 92 %, about 90 % to about 94 %, about 90 % to about 95 %, about 90 % to about 96 %, about 90 % to about 97 %, about 90 % to about 98 %, about 90 % to about 99 %, about 90 % to about 99.5 %, about 90 % to about 100 %, about 92 % to about 94 %, about 92 % to about 95 %, about 92 % to about 96 %, about 92 % to about 97 %, about 92 % to about 98 %, about 92 % to about 99 %, about 92 % to about 99.5 %, about 92 % to about 100 %, about 94 % to about 95 %, about 94 % to about 96 %, about 94 % to about 97 %, about 94 % to about 98 %, about 94 % to about 99 %, about 94 % to about 99.5 %, about 94 % to about 100 %, about 95 % to about 96 %, about 95 % to about 97 %, about 95 % to about 98 %, about 95 % to about 99 %, about 95 % to about 99.5 %, about 95 % to about 100 %, about 96 % to about 97 %, about 96 % to about 98 %, about 96 % to about 99 %, about 96 % to about 99.5 %, about 96 % to about 100 %, about 97 % to about 98 %, about 97 % to about 99 %, about 97 % to about 99.5 %, about 97 % to about 100 %, about 98 % to about 99 %, about 98 % to about 99.5 %, about 98 % to about 100 %, about 99 % to about 99.5 %, about 99 % to about 100 %, or about 99.5 % to about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of at least about 80 %, about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, or about 99.5 %. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a negative predictive value of at most about 85 %, about 90 %, about 92 %, about 94 %, about 95 %, about 96 %, about 97 %, about 98 %, about 99 %, about 99.5 %, or about 100 %. [0388] The machine learning model of step b”’, can infer whether the data set is indicative of the patient having or not having cancer with a receiver operating characteristic (ROC) curve with an Area- Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.50, at least about 0.55, at least about
Attorney Docket No.225234-718601/PCT 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of about 0.8 to about 1. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of about 0.8 to about 0.85, about 0.8 to about 0.9, about 0.8 to about 0.92, about 0.8 to about 0.94, about 0.8 to about 0.95, about 0.8 to about 0.96, about 0.8 to about 0.97, about 0.8 to about 0.98, about 0.8 to about 0.99, about 0.8 to about 0.995, about 0.8 to about 1, about 0.85 to about 0.9, about 0.85 to about 0.92, about 0.85 to about 0.94, about 0.85 to about 0.95, about 0.85 to about 0.96, about 0.85 to about 0.97, about 0.85 to about 0.98, about 0.85 to about 0.99, about 0.85 to about 0.995, about 0.85 to about 1, about 0.9 to about 0.92, about 0.9 to about 0.94, about 0.9 to about 0.95, about 0.9 to about 0.96, about 0.9 to about 0.97, about 0.9 to about 0.98, about 0.9 to about 0.99, about 0.9 to about 0.995, about 0.9 to about 1, about 0.92 to about 0.94, about 0.92 to about 0.95, about 0.92 to about 0.96, about 0.92 to about 0.97, about 0.92 to about 0.98, about 0.92 to about 0.99, about 0.92 to about 0.995, about 0.92 to about 1, about 0.94 to about 0.95, about 0.94 to about 0.96, about 0.94 to about 0.97, about 0.94 to about 0.98, about 0.94 to about 0.99, about 0.94 to about 0.995, about 0.94 to about 1, about 0.95 to about 0.96, about 0.95 to about 0.97, about 0.95 to about 0.98, about 0.95 to about 0.99, about 0.95 to about 0.995, about 0.95 to about 1, about 0.96 to about 0.97, about 0.96 to about 0.98, about 0.96 to about 0.99, about 0.96 to about 0.995, about 0.96 to about 1, about 0.97 to about 0.98, about 0.97 to about 0.99, about 0.97 to about 0.995, about 0.97 to about 1, about 0.98 to about 0.99, about 0.98 to about 0.995, about 0.98 to about 1, about 0.99 to about 0.995, about 0.99 to about 1, or about 0.995 to about 1. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of at least about 0.8, about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, or about 0.995. In some embodiments, the machine learning model of step b”’, infer whether the data set is indicative of the patient having the malignant solid tumor or benign solid tumor with a ROC curve with an AUC of at most about 0.85, about 0.9, about 0.92, about 0.94, about 0.95, about 0.96, about 0.97, about 0.98, about 0.99, about 0.995, or about 1. [0389] In some embodiments, the treatment is configured to treat a cancer of the patient. In some embodiments, the treatment is configured to reduce a severity of a cancer of the patient. In some embodiments, the treatment is configured to reduce a risk of having a cancer of the patient. The treatment can include one or more treatments of cancer. The cancer can be lung, pancreatic, ovarian, kidney, or
Attorney Docket No.225234-718601/PCT brain cancer. In certain embodiments, the cancer is lung cancer. In certain embodiments, the cancer is pancreatic cancer. In certain embodiments, the cancer is ovarian cancer. In certain embodiments, the cancer is kidney cancer. In certain embodiments, the cancer is brain cancer. In some embodiments, the data set is indicative of the patient having lung cancer, and step d”’ can include administering to the patient a treatment for lung cancer. In some embodiments, the data set is indicative of the patient having pancreatic cancer, and step d”’ can include administering to the patient a treatment for pancreatic cancer. In some embodiments, the data set is indicative of the patient having ovarian cancer, and step d”’ can include administering to the patient a treatment for ovarian cancer. In some embodiments, the data set is indicative of the patient having kidney cancer, and step d”’ can include administering to the patient a treatment for kidney cancer. In some embodiments, the data set is indicative of the patient having brain cancer, and step d”’ can include administering to the patient a treatment for brain cancer. In some embodiments, the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0390] In an aspect, the present disclosure provides a method for assessing a solid tumor of a patient, for biopsy. The method can include, any one of, any combination of, or all of steps w, x, y and z. Step w, can include obtaining a data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient. The data set can be obtained from assaying the biological sample. Step x, can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor. Step y, can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor. Step z, can include performing biopsy of the solid tumor based on the solid tumor being classified as the malignant solid tumor or benign tumor. In certain embodiments, step z, can include performing biopsy of the solid tumor based on the solid tumor being classified as the malignant solid tumor. The decision to perform biopsy may depend on confidence value of the inference. In certain embodiments, biopsy of the lung nodule of the patient is not performed. The gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, or the like. [0391] The solid tumor can be a lung tumor, pancreatic tumor, kidney tumor, ovarian tumor, or a brain tumor. In certain embodiments, the solid tumor is a lung tumor. In certain embodiments, the solid tumor is a pancreatic tumor. In certain embodiments, the solid tumor is an ovarian tumor. In certain embodiments, the solid tumor is a kidney tumor. In certain embodiments, the solid tumor is a brain tumor [0392] In some embodiments, the solid tumor is a lung tumor, the data set of step w, contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8and ii) optionally
Attorney Docket No.225234-718601/PCT clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. [0393] In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease- associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142, or any value or range there between, genes selected from the group of genes listed in Table 5. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of step w, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3. In some embodiments, the solid tumor is a lung tumor, and the at least two lung disease-associated genes of the data set of step w, includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8. In some embodiments, the solid tumor is a lung
Attorney Docket No.225234-718601/PCT tumor, the at least two lung disease-associated genes of step w, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4. In some embodiments, the solid tumor is a lung tumor, and the one or more clinical characteristics of the data set of step w, includes 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and one or more clinical characteristics of the data set of step w, includes size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the solid tumor is a lung tumor, and the data set of step w, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease- associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group of clinical characteristics listed in Table 6. In some embodiments, the solid tumor is a lung tumor, and the data set of step w, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step w, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the solid tumor is a lung tumor, the at least two lung disease-associated genes of step w, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step w, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. [0394] In certain embodiments, the solid tumor is a pancreatic tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the solid tumor is a pancreatic tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is a pancreatic tumor, and the data set of step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the solid tumor is an ovarian tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is an ovarian tumor, and the data set of
Attorney Docket No.225234-718601/PCT step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the solid tumor is a kidney tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the solid tumor is a kidney tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a kidney tumor, and the data set of step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the solid tumor is a brain tumor, and the data set of step w, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the solid tumor is a brain tumor, and the data set of step w, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In certain embodiments, the solid tumor is a brain tumor, and the data set of step w, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0395] The biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the biological sample is a blood sample or any derivative thereof. In some embodiments, the biological sample is PBMCs or any derivative thereof. In some embodiments, the biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof. [0396] The machine learning model of step x, can be a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant. The machine-learning model, e.g. of step x, can be trained according to the methods described herein, e.g. as of the machine learning model of step b. [0397] Certain aspects are directed to a method for determining cancer in a patient. The method can include, any one of, any combination of, or all of steps w’, x’, y’ and z’. Step w’ can include obtaining a
Attorney Docket No.225234-718601/PCT data set containing i) gene expression measurements of a biological sample obtained or derived from the patient, of at least two genes selected from the gene set capable of classifying a solid tumor of the patient as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient. The gene expression measurements can be obtained by assaying the biological sample. Step x’ can include providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of the patient having or not having cancer. Step y’ can include receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the patient having or not having cancer. Step z’ can include electronically outputting a report indicating the patient has, or does not have cancer. The gene expression measurement of the biological sample can be performed using any suitable technique, such as any suitable RNA quantification technique, including but not limited to RNA-seq, Ampli-seq, or the like. [0398] The cancer can be lung, pancreatic, ovarian, kidney, or brain cancer, and the solid tumor can be a pancreatic tumor, ovarian tumor, kidney tumor, or a brain tumor respectively. In certain embodiments, the cancer is lung cancer, and the solid tumor is a lung tumor. In certain embodiments, the cancer is pancreatic cancer, and the solid tumor is a pancreatic tumor. In certain embodiments, the cancer is ovarian cancer, and the solid tumor is an ovarian tumor. In certain embodiments, the cancer is kidney cancer, and the solid tumor is a kidney tumor. In certain embodiments, the cancer is brain cancer, and the solid tumor is a brain tumor. In some embodiments, the cancer is lung cancer, and the data set of step w’, contains i) gene expression measurements of the biological sample of at least two lung disease-associated genes selected from the group of genes listed in any one or more of Tables 1, 2, 3, 4, 5, 7, and 8 and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, or 182, or any value or range there between, genes selected from the group of genes listed in Table 1. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, or 175, or any value or range there between, genes selected from the group of genes listed in Table 2. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60 or 62, or any value or range there between, genes selected from the group of genes listed in Table 3. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
Attorney Docket No.225234-718601/PCT 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, or 295, or any value or range there between, genes selected from the group of genes listed in Table 4. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, or 142 genes selected from the group of genes listed in Table 5. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 genes selected from the group of genes listed in Table 7. In some embodiments, the cancer is lung cancer, the at least two lung disease- associated genes of step w’, are selected from BCAT1, CRCP, COA4, OVCA2, POM121, HLA-DPA1, VPS37C, MGST2, RNF220, HDAC3, NFE2L1, WDR20, CNPY4, HOXB2, C6orf120, TMEM8A, ASAP1-IT2, C15orf54, CD101, FNBP1, TECR, PROK2, SLC35B3, TDRD9, CLHC1, LPL, IFITM3, OGFOD3, EIF2B3, TMEM65, and MKRN3. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of the data set of step w’, include at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or 21 genes selected from the group of genes listed in Table 8. In some embodiments, the cancer is lung cancer, the at least two lung disease-associated genes of step w, are selected from BCAT1, USP32P2, CD177, QPCT, SCAF4, SNRPD3, BCL9L, THBS1, SLC22A18AS, ARCN1, DHX16, SATB1, ST6GAL1, CXCL1, TDRD9, ZNF831, MTCH1, FAM86HP, DHX8, RNF114, and DCTN4. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes, 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics selected from the group listed in Table 6. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes size of the nodule. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes age of the patient. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, includes presence of the nodule in the lung upper lobe. In some embodiments, the cancer is lung cancer, and the one or more clinical characteristics of the dataset of step w’, include size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the cancer is lung cancer, and the data set of step w’, contains i) gene expression measurements of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease associated genes selected from the group of genes listed in Table 7, and ii) clinical characteristics data of 1, 2, 3, 4, 5, 6, 7 or 8 clinical characteristics of the patient selected from the group of clinical characteristics listed in Table 6. In some embodiments, the cancer is lung cancer, and the data set of step w’, contains i) gene expression measurements of the biological sample of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, or 31 lung disease-associated genes selected from the group
Attorney Docket No.225234-718601/PCT of genes listed in Table 7, and ii) clinical characteristics data of the clinical characteristics of the patient selected from size of the nodule, age of the patient, presence of the nodule in the lung upper lobe, or any combination thereof. In some embodiments, the cancer is lung cancer, and the at least two lung disease- associated genes of step w’, comprise the 31 genes listed in Table 7, and the one or more clinical characteristics of step w’, comprise the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In some embodiments, the cancer is lung cancer, and the at least two lung disease-associated genes of step w’, consist of the 31 genes listed in Table 7, and the one or more clinical characteristics of step w’, consist of the size of the nodule, age of the patient, and the presence of the nodule in the lung upper lobe. In certain embodiments, the cancer is pancreatic cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant. In certain embodiments, the cancer is pancreatic cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the cancer is pancreatic cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a pancreatic tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer. In certain embodiments, the cancer is ovarian cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant. In certain embodiments, the cancer is ovarian cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the cancer is ovarian cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying an ovarian tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to ovarian cancer. In certain embodiments, the cancer is kidney cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant. In certain embodiments, the cancer is kidney cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the cancer is kidney cancer, and the data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a kidney tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to kidney cancer. In certain embodiments, the cancer is brain cancer, and the data set of step w’, contains gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant. In certain embodiments, the cancer is brain cancer, and the data set of step w’, contains clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. In certain embodiments, the cancer is brain cancer, and the
Attorney Docket No.225234-718601/PCT data set of step w’, contains i) gene expression measurements of at least two genes selected from the gene set capable of classifying a brain tumor as benign or malignant, and ii) clinical characteristics data of one or more clinical characteristics selected from a group of clinical characteristics related to brain cancer. The gene set capable of classifying a solid tumor, e.g. pancreatic, ovarian, kidney, brain tumor, as benign or malignant, can be obtained or determined according to the methods (e.g. method of steps a’, b’, c’ and/or d’) described herein. [0399] The biological sample can be blood sample, isolated peripheral blood mononuclear cells (PBMCs), solid tumor biopsy sample, nasal fluid, saliva, urine, stool, cerebrospinal fluid (CSF), or any derivative thereof. In some embodiments, the biological sample is a blood sample or any derivative thereof. In some embodiments, the biological sample is PBMCs or any derivative thereof. In some embodiments, the biological sample is a lung biopsy sample, or any derivative thereof. In some embodiments, the biological sample is a nasal fluid sample, or any derivative thereof. In some embodiments, the biological sample is a saliva sample, or any derivative thereof. In some embodiments, the biological sample is a urine sample, or any derivative thereof. In some embodiments, the biological sample is a stool sample, or any derivative thereof. In some embodiments, the biological sample is CSF sample, or any derivative thereof. [0400] The method can determine whether the patient has or does not have cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with an sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with an specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with a positive predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The method can determine whether the patient has or does not have cancer with a negative predictive value of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least
Attorney Docket No.225234-718601/PCT about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. The machine learning model, e.g. of step x’, can infer whether the data set is indicative of the patient having or not having cancer with a receiver operating characteristic (ROC) curve with an AUC of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0401] The inference from the machine learning model can include a confidence value between 0 and 1, such as, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 or 1, or any value or range there between, that the patient has cancer. Higher confidence values may be correlated with a higher likelihood that the patient has is cancer. [0402] The machine-learning model, e.g. of step x’, can generate inference of whether the data set is indicative of the patient having a malignant solid tumor or a benign solid tumor, wherein the patient having malignant solid tumor may indicate that the patient has cancer, and patient having benign solid tumor may indicate that the patient does not have lung cancer. The machine-learning model can be trained according to a method described herein, e.g. according to the methods of training of the machine learning model of step b. [0403] In another aspect, the present disclosure provides a computer system for assessing a solid tumor of a subject, containing: a database or other suitable data storage system that is configured to store a dataset containing a) gene expression measurements of a biological sample obtained or derived from the subject, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject; and one or more computer processors operatively coupled to the database, wherein the one or more computer processors are individually or collectively programmed to: (i) analyze the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor; (ii) electronically output a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor. Computer-implemented methods as described herein may be executed on computer systems such as those described above. For example, a computer system may comprise one or more processors and one or more memory units that collectively store computer- readable executable instructions that, as a result of execution, cause the one or more processors to collectively perform the programmed steps described above. A computer system as described herein may comprise an assay device communicatively coupled to a personal computer. The data set can be a data set (e.g. of step a) described herein. The biological sample can be a biological sample described herein. [0404] In some embodiments, the computer system further comprises an electronic display operatively coupled to the one or more computer processors, wherein the electronic display comprises a graphical user interface that is configured to display the report.
Attorney Docket No.225234-718601/PCT [0405] In another aspect, the present disclosure provides one or more non-transitory computer readable media collectively containing machine-executable code that, upon execution by one or more computer processors, causes the one or more computer processors to perform a method for assessing a solid tumor of a subject, the method containing: (a) obtaining a data set containing a) gene expression measurements of a biological sample obtained or derived from a subject, of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject; (b) analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor; and (c) electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor. The data set can be a data set (e.g. of step a) described herein. The biological sample can be a biological sample described herein. [0406] FIG.10 illustrates an overview of an example method 1000 for assessing a solid tumor of a subject. The method 1000 may comprise assaying a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from a gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the subject, as in operation 1002. The method 1000 may comprise analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor, as in operation 1004. The method 1000 may comprise electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor, as in operation 1006. The data set can be a data set (e.g. of step b) described herein. [0407] Methods of the present disclosure may comprise applying a trained machine learning algorithm to gene expression data (e.g., acquired by RNA-Seq, Ampli-seq, or like) and optionally clinical characteristics data of a subject, to assess a solid tumor of the subject. The trained machine learning algorithm may comprise a machine learning based classifier, configured to process the gene expression data and optionally clinical characteristics data to assess the solid tumor (e.g., determine whether a solid tumor is malignant or benign). The machine learning classifier may be trained using clinical datasets, e.g. reference data sets from one or more cohorts of subjects, e.g., using gene expression data and optionally clinical health data, e.g. clinical characteristics data of the subjects as inputs and known clinical health outcomes (e.g., a solid tumor that is malignant or benign) of the subjects as outputs to the machine learning classifier. [0408] The machine learning classifier may comprise one or more machine learning algorithms. Examples of machine learning algorithms may include a linear regression, a logistic regression (LOG), a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB) or any
Attorney Docket No.225234-718601/PCT combination thereof, or another supervised learning algorithm or unsupervised learning algorithm for classification and regression. The machine learning classifier may be trained using one or more reference datasets corresponding to subject data (e.g., gene expression data and optionally clinical health data). [0409] Reference datasets used for training machine learning classifiers, may be generated from, for example, one or more cohorts of patients having common clinical characteristics (features) and clinical outcomes (labels). Reference datasets may comprise a set of features and labels corresponding to the features. Features may correspond to algorithm inputs comprising subject data (e.g., gene expression data and optionally clinical health data, e.g. clinical characteristics data). Features may comprise clinical characteristics such as, for example, certain ranges, categories, or levels of gene expression data and optionally clinical health data. Features may comprise subject information such as patient age, patient medical history, other medical conditions, current or past medications, size of the nodule, presence of the nodule in the lung upper lobe and/or time since the last observation. For example, a set of features collected from a given patient at a given time point may collectively serve as a signature, which may be indicative of clinical health outcomes (e.g., a solid tumor that is malignant or benign) of the subject at the given time point. [0410] For example, ranges of subject data (e.g., gene expression data and optionally clinical health data) may be expressed as a plurality of disjoint continuous ranges of continuous measurement values, and categories of subject data (e.g., gene expression data and/or clinical health data) may be expressed as a plurality of disjoint sets of measurement values (e.g., {“high”, “low”}, {“high”, “normal”}, {“low”, “normal”}, {“high”, “borderline high”, “normal”, “low”}, {“Yes”, “No”}, {“Present”, “Absent”} etc.). Clinical characteristics may also include clinical labels indicating the subject’s health history, such as a diagnosis of a disease or disorder, a previous administering of a clinical treatment (e.g., a drug, a surgical treatment, chemotherapy, radiotherapy, immunotherapy, etc.), behavioral factors, or other health status (e.g., hypertension or high blood pressure, hyperglycemia or high blood glucose, hypercholesterolemia or high blood cholesterol, history of allergic reaction or other adverse reaction, etc.). Clinical characteristics data for the clinical characteristic, AGE, of the patient can be age of the patient. Clinical characteristics data for the clinical characteristic, SEX, of the patient can be sex of the patient. Clinical characteristics data for the clinical characteristic, presence of the nodule in the lung upper lobe (NCNUPYN), of the patient can be yes or no. Clinical characteristics data for the clinical characteristic, smoking status (MHTBSTAT), of the patient can be past or current. Clinical characteristics data for the clinical characteristics, chronic obstructive pulmonary disease (MHCPDYN), of the patient can be yes or no. Clinical characteristics data for the clinical characteristics, lung nodule spiculated (NCNMYN), of the patient can be yes or no. Clinical characteristics data for the clinical characteristic, emphysemal (MHEMPYN), of the patient can be yes or no. Labels may comprise clinical outcomes such as, for example, a solid tumor that is malignant or benign. [0411] The machine learning classifier algorithm may process the input features to generate output values comprising one or more classifications, one or more predictions, or a combination thereof. For
Attorney Docket No.225234-718601/PCT example, such classifications or predictions may include a binary classification of a solid tumor, a classification between a group of categorical labels (e.g., ‘malignant solid tumor’ and ‘benign solid tumor’), a likelihood (e.g., relative likelihood or probability) of having a malignant solid tumor or benign solid tumor, and a confidence interval for any numeric predictions. Various machine learning techniques may be cascaded such that the output of a machine learning technique may also be used as input features to subsequent layers or subsections of the machine learning classifier. [0412] In order to train the machine learning classifier model (e.g., by determining weights and correlations of the model) to generate real-time classifications or predictions, the model can be trained using reference datasets. Such datasets may be sufficiently large to generate statistically significant classifications or predictions. In some cases, datasets are annotated or labeled. [0413] Datasets may be split into subsets (e.g., discrete or overlapping), such as a training dataset, a development dataset, and a test dataset. For example, a dataset may be split into a training dataset comprising 80% of the dataset, a development dataset comprising 10% of the dataset, and a test dataset comprising 10% of the dataset. The training dataset may comprise about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, or about 90% of the dataset. The development dataset may comprise about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, or about 90% of the dataset. The test dataset may comprise about 10%, about 20%, about 30%, about 40%, about 50%, about 60%, about 70%, about 80%, or about 90% of the dataset. Training sets (e.g., training datasets) may be selected by random sampling of a set of data corresponding to one or more patient cohorts to ensure independence of sampling. Alternatively, training sets (e.g., training datasets) may be selected by proportionate sampling of a set of data corresponding to one or more patient cohorts to ensure independence of sampling. [0414] Reference datasets may be split into subsets (e.g., discrete or overlapping), such as a training dataset, and a validation dataset. For example, a reference dataset may be split into a training dataset containing 80% of the dataset, and a validation dataset containing 20% of the dataset. The training dataset may contain 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 95%, or any values or ranges there between, of the reference dataset. The validation dataset may contain 5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% or 95%, or any values or ranges there between, of the reference dataset. 2, 2.5, 5 or 10, or any values or ranges there between, fold cross validation can be used. [0415] To validate the performance of the machine learning classifier model, different performance metrics may be generated. For example, an area under the receiver-operating curve (AUROC) may be used to determine the diagnostic capability of the machine learning classifier. For example, the machine learning classifier may use classification thresholds which are adjustable, such that specificity and sensitivity are tunable, and the receiver-operating curve (ROC) can be used to identify the different operating points corresponding to different values of specificity and sensitivity.
Attorney Docket No.225234-718601/PCT [0416] In some cases, such as when datasets are not sufficiently large, cross-validation may be performed to assess the robustness of a machine learning classifier model across different training and testing datasets. [0417] To calculate performance metrics such as sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), AUPRC, AUROC, or similar, the following definitions may be used. A “false positive” may refer to an outcome in which a solid tumor of a subject is incorrectly classified as a malignant solid tumor. A “true positive” may refer to an outcome in which a solid tumor of a subject is correctly classified as a malignant solid tumor. A “false negative” may refer to an outcome in which a solid tumor of a subject is incorrectly classified as a benign solid tumor. A “true negative” may refer to an outcome in which a solid tumor of a subject is correctly classified as a benign solid tumor. [0418] The machine learning classifier may be trained until certain predetermined conditions for accuracy or performance are satisfied, such as having minimum desired values corresponding to diagnostic accuracy measures. For example, the diagnostic accuracy measure may correspond to prediction of a likelihood of a solid tumor being malignant or benign. Examples of diagnostic accuracy measures may include sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, area under the precision-recall curve (AUPRC), and area under the curve (AUC) of a Receiver Operating Characteristic (ROC) curve (AUROC) corresponding to the diagnostic accuracy of determining whether a solid tumor is malignant or benign. [0419] For example, such a predetermined condition may be that the sensitivity of determining whether a solid tumor is malignant or benign comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%. [0420] As another example, such a predetermined condition may be that the specificity of determining whether a solid tumor is malignant or benign comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%. [0421] As another example, such a predetermined condition may be that the positive predictive value (PPV) of determining whether a solid tumor is malignant or benign comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%. [0422] As another example, such a predetermined condition may be that the negative predictive value (NPV) of determining whether a solid tumor is malignant or benign comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least
Attorney Docket No.225234-718601/PCT about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%. [0423] As another example, such a predetermined condition may be that the area under the curve (AUC) of a Receiver Operating Characteristic (ROC) curve (AUROC) of determining whether a solid tumor is malignant or benign comprises a value of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99. [0424] As another example, such a predetermined condition may be that the area under the precision- recall curve (AUPRC) of determining whether a solid tumor is malignant or benign comprises a value of at least about 0.10, at least about 0.15, at least about 0.20, at least about 0.25, at least about 0.30, at least about 0.35, at least about 0.40, at least about 0.45, at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99. [0425] In some embodiments, the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with a sensitivity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%. [0426] In some embodiments, the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with a specificity of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%. [0427] In some embodiments, the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with a positive predictive value (PPV) of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%. [0428] In some embodiments, the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with a negative predictive value (NPV) of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
Attorney Docket No.225234-718601/PCT [0429] In some embodiments, the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with an area under the curve (AUC) of a Receiver Operating Characteristic (ROC) curve (AUROC) of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99. [0430] In some embodiments, the trained classifier may be trained or configured to determine whether a solid tumor is malignant or benign with an area under the precision-recall curve (AUPRC) of at least about 0.10, at least about 0.15, at least about 0.20, at least about 0.25, at least about 0.30, at least about 0.35, at least about 0.40, at least about 0.45, at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99. [0431] The present disclosure provides computer systems that are programmed to implement methods of the disclosure. FIG.11 shows a computer system 1101 that is programmed or otherwise configured to implement methods provided herein. [0432] The computer system 1101 can regulate various aspects of the present disclosure, such as, for example, assaying a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor, and electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor. The computer system 1101 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device. [0433] The computer system 1101 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 1105, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 1101 also includes memory or memory location 1110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 1115 (e.g., hard disk), communication interface 1120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 1125, such as cache, other memory, data storage and/or electronic display adapters. The memory 1110, storage unit 1115, interface 1120 and peripheral devices 1125 are in communication with the CPU 1105 through a communication bus (solid lines), such as a motherboard. The storage unit 1115 can be a data storage unit (or data repository) for storing data. The computer system 1101 can be operatively coupled to a computer network (“network”) 1130 with the aid of the
Attorney Docket No.225234-718601/PCT communication interface 1120. The network 1130 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. [0434] The network 1130 in some cases is a telecommunication and/or data network. The network 1130 can include one or more computer servers, which can enable distributed computing, such as cloud computing. For example, one or more computer servers may enable cloud computing over the network 1130 (“the cloud”) to perform various aspects of analysis, calculation, and generation of the present disclosure, such as, for example, assaying a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, analyzing the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor, and electronically outputting a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor. Such cloud computing may be provided by cloud computing platforms such as, for example, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform, and IBM cloud. The network 1130, in some cases with the aid of the computer system 1101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 1101 to behave as a client or a server. [0435] The CPU 1105 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 1110. The instructions can be directed to the CPU 1105, which can subsequently program or otherwise configure the CPU 1105 to implement methods of the present disclosure. Examples of operations performed by the CPU 1105 can include fetch, decode, execute, and writeback. [0436] The CPU 1105 can be part of a circuit, such as an integrated circuit. One or more other components of the system 1101 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC). [0437] The storage unit 1115 can store files, such as drivers, libraries and saved programs. The storage unit 1115 can store user data, e.g., user preferences and user programs. The computer system 1101 in some cases can include one or more additional data storage units that are external to the computer system 1101, such as located on a remote server that is in communication with the computer system 1101 through an intranet or the Internet. [0438] The computer system 1101 can communicate with one or more remote computer systems through the network 1130. For instance, the computer system 1101 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 1101 via the network 1130.
Attorney Docket No.225234-718601/PCT [0439] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 1101, such as, for example, on the memory 1110 or electronic storage unit 1115. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 1105. In some cases, the code can be retrieved from the storage unit 1115 and stored on the memory 1110 for ready access by the processor 1105. In some situations, the electronic storage unit 1115 can be precluded, and machine-executable instructions are stored on memory 1110. [0440] The code can be pre-compiled and configured for use with a machine having a processor adapted to execute the code, or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled or as-compiled fashion. [0441] Aspects of the systems and methods provided herein, such as the computer system 1101, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine- executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non- transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution. [0442] Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated
Attorney Docket No.225234-718601/PCT during radio frequency (RF) and infrared (IR) data communications. Common forms of computer- readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution. [0443] The computer system 1101 can include or be in communication with an electronic display 1135 that comprises a user interface (UI) 1140. Examples of user interfaces (UIs) include, without limitation, a graphical user interface (GUI) and web-based user interface. For example, the computer system can include a graphical user interface (GUI) configured to display, for example, subject data, identification of a solid tumor of the subject as a malignant solid tumor or a benign solid tumor, and/or predictions or assessments generated from subject data. [0444] Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 1105. The algorithm can, for example, assay a biological sample obtained or derived from the subject to produce a data set containing i) gene expression measurements of the biological sample of at least two genes selected from the gene set capable of classifying a solid tumor as benign or malignant, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, analyze the data set to classify the solid tumor of the subject as a malignant solid tumor or a benign solid tumor, and electronically output a report indicative of the classification of the solid tumor of the subject as the malignant solid tumor or the benign solid tumor. Illustrative Embodiments [0445] The present disclosure provides the following illustrative embodiments. [0446] Embodiment 1. A method for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy, the method comprising: a) obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof;
Attorney Docket No.225234-718601/PCT b) training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a solid tumor is benign or malignant based at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics; c) determining feature importance values of the plurality of genes; and d) determining the gene set based at least in part on the feature importance values. [0447] Embodiment 2. The method of embodiment 1, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor. [0448] Embodiment 3. The method of embodiment 1, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer. [0449] Embodiment 4. The method of embodiment 1, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer. [0450] Embodiment 5. The method of embodiment 1, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer. [0451] Embodiment 6. The method of embodiment 1, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer. [0452] Embodiment 7. The method of embodiment 1 or 3, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. [0453] Embodiment 8. The method of embodiment 1 or 4, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. [0454] Embodiment 9. The method of embodiment 1 or 5, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. [0455] Embodiment 10. The method of embodiment 1 or 6, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. [0456] Embodiment 11. A method for developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant, the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid
Attorney Docket No.225234-718601/PCT tumor, ii) data regarding whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics; (c) determining feature importance values of the one or more predictors of the first machine learning model; (d) selecting A predictors of the first machine learning model based at least in part on the feature importance values, wherein A is an integer from 5 to 2000; and (e) training a second machine learning model based at least in part on a second reference data set comprising a plurality of second individual reference data sets, wherein a respective second individual reference data set of the plurality of second individual reference data sets comprises i) measurement data of the A predictors of the reference subject, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, to obtain the trained machine learning model, wherein the trained machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on measurement data of the A predictors. [0457] Embodiment 12. The method of embodiment 11, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor. [0458] Embodiment 13. The method of embodiment 11, wherein the solid tumor is a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer. [0459] Embodiment 14. The method of embodiment 11, wherein the solid tumor is an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer. [0460] Embodiment 15. The method of embodiment 11, wherein the solid tumor is a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer. [0461] Embodiment 16. The method of embodiment 11, wherein the solid tumor is a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer. [0462] Embodiment 17. The method of embodiment 11 or 13, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer.
Attorney Docket No.225234-718601/PCT [0463] Embodiment 18. The method of embodiment 11 or 14, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. [0464] Embodiment 19. The method of embodiment 11 or 15, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. [0465] Embodiment 20. The method of embodiment 11 or 16, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. [0466] Embodiment 21. The method of any one of embodiments 11 to 20, wherein the A predictors have top 5 to 200 feature importance values. [0467] Embodiment 22. The method of any one of embodiments 11 to 21, wherein the trained machine learning model has an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0468] Embodiment 23. The method of any one of embodiments 11 to 22, wherein the trained machine learning model has a sensitivity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0469] Embodiment 24. The method of any one of embodiments 11 to 23, wherein the trained machine learning model has a specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0470] Embodiment 25. The method of any one of embodiments 11 to 24, wherein the trained machine learning model has a positive predictive value at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0471] Embodiment 26. The method of any one of embodiments 11 to 25, wherein the trained machine learning model has a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0472] Embodiment 27. The method of any one of embodiments 11 to 26, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least
Attorney Docket No.225234-718601/PCT about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0473] Embodiment 28. The method of any one of embodiments 11 to 27, wherein the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof. [0474] Embodiment 29. A method for assessing a solid tumor of a patient, the method comprising: a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of embodiment 1, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and d) electronically outputting a report classifying the solid tumor of the patient as the malignant or the benign solid tumor. [0475] Embodiment 30. The method of embodiment 29, wherein the solid tumor is a pancreatic tumor, ovarian tumor, kidney or brain tumor. [0476] Embodiment 31. The method of embodiment 29, wherein the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set any one of embodiments 1, 3, or 7, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant. [0477] Embodiment 32. The method of embodiment 29, wherein the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set of embodiments 1, 4, or 8, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant. [0478] Embodiment 33. The method of embodiment 29, wherein the solid tumor is a brain tumor, and the at least 2 genes selected are from the gene set of embodiments 1, 5, or 9, wherein the gene set is capable of classifying the brain tumor as benign or malignant.
Attorney Docket No.225234-718601/PCT [0479] Embodiment 34. The method of embodiment 29, wherein the solid tumor is a kidney tumor, and the at least 2 genes selected are from the gene set of embodiments 1, 6, or 10, wherein the gene set is capable of classifying the kidney tumor as benign or malignant. [0480] Embodiment 35. The method of embodiment 29 or 31, wherein the solid tumor is a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. [0481] Embodiment 36. The method of embodiment 29 or 32, wherein the solid tumor is an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. [0482] Embodiment 37. The method of embodiment 29 or 33, wherein the solid tumor is a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. [0483] Embodiment 38. The method of embodiment 29 or 34, wherein the solid tumor is a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. [0484] Embodiment 39. The method of any one of embodiments 29 to 38, wherein the machine-learning model is trained according to the method of any one of embodiments 11 to 28. [0485] Embodiment 40. The method of any one of embodiments 29 to 39, wherein the patient has cancer. [0486] Embodiment 41. The method of any one of embodiments 29 to 39, wherein the patient does not have cancer. [0487] Embodiment 42. The method of any one of embodiments 29 to 39, wherein the patient is at an elevated risk of having cancer. [0488] Embodiment 43. The method of any one of embodiments 29 to 39, and 42, wherein the patient is asymptomatic for cancer. [0489] Embodiment 44. The method of any one of embodiments 40 to 43, wherein the cancer is pancreatic cancer, ovarian cancer or brain cancer. [0490] Embodiment 45. The method of any one of embodiments 29 to 44, further comprising administering a treatment based on the patient’s solid tumor being classified as malignant. [0491] Embodiment 46. The method of embodiment 45, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. [0492] Embodiment 47. The method of any one of embodiments 29 to 46, wherein the inference includes a confidence value between 0 and 1 that the solid tumor is malignant.
Attorney Docket No.225234-718601/PCT [0493] Embodiment 48. The method of any one of embodiments 29 to 47, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0494] Embodiment 49. The method of any one of embodiments 29 to 48, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0495] Embodiment 50. The method of any one of embodiments 29 to 49, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0496] Embodiment 51. The method of any one of embodiments 29 to 50, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0497] Embodiment 52. The method of any one of embodiments 29 to 51, comprising classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0498] Embodiment 53. The method of any one of embodiments 29 to 52, wherein the trained machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. [0499] Embodiment 54. A method for treating cancer in a patient having a solid tumor, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more
Attorney Docket No.225234-718601/PCT clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) administering a treatment based on the patient’s tumor being classified as a malignant tumor. [0500] Embodiment 55. The method of embodiment 54, wherein the cancer is a pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer. [0501] Embodiment 56. A system for assessing a solid tumor of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. [0502] Embodiment 57. A non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of any one of embodiments 1 to 10, and ii) optionally clinical characteristics data of one or
Attorney Docket No.225234-718601/PCT more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor; receive, as an output of the machine-learning model, the inference indicating whether the composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. [0503] Embodiment 58. A method for obtaining a gene set capable of classifying whether a patient has cancer, the method comprising: (a) providing a dataset as an input to a machine learning classifier, said dataset comprises or is derived from gene expression measurements from a plurality of reference samples, of genes listed in a plurality of gene modules, wherein the plurality of reference samples comprises a first plurality of reference samples obtained or derived from reference subjects having cancer, and a second plurality of reference samples obtained or derived from reference subjects not having cancer; (b) performing feature selection to select a subset of gene modules from the plurality of gene modules, wherein the plurality of the gene modules form the features of the machine learning classifier; and (c) selecting differentially expressed genes from the genes listed in the subset of gene modules to obtain the gene set capable of classifying whether the patient has cancer. [0504] Embodiment 59. The method of embodiment 58, wherein the machine learning classifier is sequential grouped feature importance (SGFI) algorithm. [0505] Embodiment 60. The method of embodiment 58 or 59, wherein the feature selection comprises starting from a featureless model, and sequentially adding next best feature using leave-one-group-in importance (LOGI) until no further improvement in mean misclassification error (MMCE) over an improvement threshold is achieved. [0506] Embodiment 61. The method of embodiment 60, wherein the improvement threshold is 0.00001, 0.00005, 0.0001, 0.0005, or 0.001. [0507] Embodiment 62. The method of any one of embodiments 58 to 61, wherein the dataset is a batch corrected dataset. [0508] Embodiment 63. The method of any one of embodiments 58 to 62, wherein the plurality of gene modules are obtained by a method comprising:
Attorney Docket No.225234-718601/PCT providing an initial data set comprising gene expression measurement from the plurality of reference samples, of genes of an initial gene-set; selecting M genes from the initial gene set, wherein said M genes are M variably expressed genes of the initial data set, and wherein M is an integer; and clustering the M genes to obtain the plurality of gene modules. [0509] Embodiment 64. The method of embodiment 63, wherein the M genes are clustered based on protein protein interaction of the proteins encoded by the M genes. [0510] Embodiment 65. The method of embodiment 63 to 64, wherein the M genes are M most variably expressed genes of the initial data set. [0511] Embodiment 66. The method of any one of embodiments 63 to 65, wherein M is 500 to 10000. [0512] Embodiment 67. The method of any one of embodiments 58 to 66, further comprising analyzing a patient data set comprising or derived from gene expression measurement of at least 2 genes selected from the genes within the gene set obtained in step (c) to classify whether a patient has cancer, wherein the gene expression measurement is obtained from a biological sample obtained or derived from the patient. [0513] Embodiment 68. The method of embodiment 67, wherein the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 110, 120, 130, 140, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, or all genes selected from the genes within the gene set obtained in step (c). [0514] Embodiment 69. The method of embodiment 67, wherein the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c). [0515] Embodiment 70. The method of any one of embodiments 67 to 69, wherein the method classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0516] Embodiment 71. The method of any one of embodiments 67 to 70, wherein the method classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
Attorney Docket No.225234-718601/PCT [0517] Embodiment 72. The method of any one of embodiments 67 to 71, wherein the method classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0518] Embodiment 73. The method of any one of embodiments 67 to 72, wherein the method classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0519] Embodiment 74. The method of any one of embodiments 67 to 73, wherein the method classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0520] Embodiment 75. The method of any one of embodiments 67 to 74, wherein analyzing the patient data set comprises providing the patient data set as an input to a machine-learning model trained to generate an inference indicative of whether the patient has cancer based on the patient data set. [0521] Embodiment 76. The method of embodiment 75, further comprising: a) receiving as an output of the machine-learning model the inference; and b) electronically outputting a report classifying whether the patient has cancer based on the inference. [0522] Embodiment 77. The method of embodiment 75 or 76, wherein the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof. [0523] Embodiment 78. The method of any one of embodiments 75 to 77, wherein the machine-learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer.
Attorney Docket No.225234-718601/PCT [0524] Embodiment 79. The method of any one of embodiments 67 to 78, wherein the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof. [0525] Embodiment 80. The method of any one of embodiments 67 to 79, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer. [0526] Embodiment 81. The method of embodiment 80, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof. [0527] Embodiment 82. The method of any one of embodiments 58 to 81, wherein the cancer is a solid cancer. [0528] Embodiment 83. The method of embodiment 82, wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer. [0529] Embodiment 84. The method of any one of embodiments 58 to 81, wherein the cancer is a blood cancer. [0530] Embodiment 85. The method of embodiment 84, the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post- transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. [0531] Embodiment 86. A method for classifying whether a patient has cancer, the method comprising: providing a patient data set comprising or derived from gene expression measurements of at least 2 genes selected from genes within the gene set obtained in step (c) of any one of embodiments 58-66 as an input to a machine learning model trained to generate an inference of whether the patient data set is indicative of the patient having cancer; receiving, as an output of the machine-learning model the inference; and electronically outputting a report classifying whether the patient has cancer based on the inference, wherein the gene expression measurements are obtained from a biological sample obtained or derived from the patient.
Attorney Docket No.225234-718601/PCT [0532] Embodiment 87. The method of embodiment 86, wherein the patient dataset comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200 or all genes selected from the genes within the gene set obtained in step (c) of any one of embodiments 58 to 66. [0533] Embodiment 88. The method of embodiment 86, wherein the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c). [0534] Embodiment 89. The method of any one of embodiments 86 to 88, wherein the patient dataset is derived from the gene expression measurements using gene set variation analysis (GSVA), gene set enrichment analysis (GSEA), enrichment algorithm, multiscale embedded gene co-expression network analysis (MEGENA), weighted gene co-expression network analysis (WGCNA), differential expression analysis, Z-score, log2 expression analysis, or any combination thereof. [0535] Embodiment 90. The method of any one of embodiments 86 to 88, wherein the patient dataset is derived from the gene expression measurements using GSVA. [0536] Embodiment 91. The method of any one of embodiments 86 to 90, wherein the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof. [0537] Embodiment 92. The method of any one of embodiments 86 to 91, wherein the method classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0538] Embodiment 93. The method of any one of embodiments 86 to 92, wherein the method classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.
Attorney Docket No.225234-718601/PCT [0539] Embodiment 94. The method of any one of embodiments 86 to 93, wherein the method classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0540] Embodiment 95. The method of any one of embodiments 86 to 94, wherein the method classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0541] Embodiment 96. The method of any one of embodiments 86 to 95, wherein the method classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. [0542] Embodiment 97. The method of any one of embodiments 86 to 96, wherein the machine-learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer. [0543] Embodiment 98. The method of any one of embodiments 86 to 97, wherein the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof. [0544] Embodiment 99. The method of any one of embodiments 86 to 98, wherein the cancer is a solid cancer. [0545] Embodiment 100. The method of embodiment 99, wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer. [0546] Embodiment 101. The method of any one of embodiments 86 to 98, wherein the cancer is a blood cancer.
Attorney Docket No.225234-718601/PCT [0547] Embodiment 102. The method of embodiment 101, the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS-related lymphoma, multiple myeloma, plasmacytoma, post- transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. [0548] Embodiment 103. The method of any one of embodiments 86 to 102, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer. [0549] Embodiment 104. The method of embodiment 103, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof.
Attorney Docket No.225234-718601/PCT EXAMPLES Example 1: Machine Learning Classification of RNA-Seq Data [0550] Differential gene expression analysis was performed to identify genes that were most differentially expressed (e.g., biomarkers) in whole blood samples between subjects having benign lung nodules and malignant lung nodules. A biomarker dataset comprising samples from 152 subjects was analyzed. Among those, 80 of the samples in the biomarker dataset had a diagnosis of a benign lung nodule, and 72 samples had a diagnosis of a malignant lung nodule. Gene expression measurements of whole blood samples from the subjects were analyzed using RNA-Seq technique. [0551] A training dataset comprising lung nodule samples from 604 subjects was used to train a machine learning algorithm. Gene expression measurements of whole blood samples from the subjects were analyzed. Subsequently, a validation dataset comprising samples of long noduless from 487 subjects were used to validate the machine learning algorithm. The samples were analyzed using RNA- Seq techniques. In the following examples, eight machine learning classifiers including Gradient boosting machines (GBM), Logistic regression model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Naïve Bayes (NB) and Elastic Networks (EN) were trained to distinguish malignant lung nodules versus benign lung nodules based on an analysis of the RNA-Seq data. [0552] Eight different machine learning classifiers were trained to determine a high-performing set of genes to distinguish malignant lung nodules versus benign lung nodules using the biomarker dataset. The biomarker dataset was obtained by whole transcriptome RNA sequencing. The biomarker dataset comprised 80 lung nodule samples that had a diagnosis of a benign lung nodule and 72 samples that had a diagnosis of a malignant lung nodule. [0553] A total of 1,430 genes were initially identified to be differentially expressed between malignant lung nodule samples and benign lung nodule samples. A Log2 ratio of gene expression of the differentially expressed genes was used to determine the optimal set of genes. The Log2 ratio was defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample. After removing a subset of the 1,430 genes that exhibited collinear expression (correlation or r > 0.8), a total number of 1,178 gene features (Table 9) were identified. A2M- PITPNM TMEM1 1 1
Attorney Docket No.225234-718601/PCT TMEM1 AARS2 CCAR1 EIF4G1 HRSP12 MEST PJA1 SERP1 89 1 2 2 5 3 6 6 7 8 9 2 S S P F
Attorney Docket No.225234-718601/PCT POLR1 ACTN4 CD58 EXOC1 INHBB MKKS B SLAMF7 TNNT1 A 1 IP 2 6 2 0 L 4
Attorney Docket No.225234-718601/PCT CFAP58- TSPAN3 AKR1C1 AS1 FAT4 KCNA2 MYLK PRDX3 SLC46A1 3 9 2 C B L 1 1
Attorney Docket No.225234-718601/PCT APOBEC PTOV1- 3F CNNM4 FLT3 L3MBTL1 NFKBIB AS2 SNORA38 UGCG 1 L B L
Attorney Docket No.225234-718601/PCT ATAD3B CSF1R GEMIN5 LINC00944 NT5M RAI1 SRP68 VPS26A CSGALNA C - 1 B 2
Attorney Docket No.225234-718601/PCT LOC10192 BEX1 DDA1 GPR160 7153 P3H4 RFWD3 SYTL2 YIPF1 7 A 2 3 8 1 1 3 D 7 2 5 0 2 1 8
Attorney Docket No.225234-718601/PCT C1GALT RUNX1- 1 DLG4 HEBP2 LUC7L PDE1B IT1 TGFB1 ZNF500 2 7 6 4 5 4 9 0 8 4 0 4 7 2 0 8 0- 4
Attorney Docket No.225234-718601/PCT HNRNP CADM1 EHMT1 UL1 MCM8 PIK3C2B SEC1P TMEM104 ZNF865 2
[0554] The eight machine learning classifiers were then validated using the 1,178 gene features via a cross validation method. In the cross validation method, the biomarkers dataset was divided into two groups comprising a training set and a validation set. FIGs.1A-1B show results of a cross validation experiment when 80% of the dataset was considered for training the classifiers while 20% of the dataset was used for validation. [0555] FIG.1A is a receiver operating characteristic (ROC) plot showing performance of eight machine learning classifiers using a set of 1,178 gene features generated from ribonucleic acid (RNA) sequencing (RNA-Seq) data to distinguish malignant lung nodules versus benign lung nodules. The set of 1,178 genes were differentially expressed in blood samples of patients with malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0556] FIG.1B shows results of exemplary trained machine learning classifier algorithms to analyze RNASeq data using a set of 1,178 gene features to distinguish malignant lung nodules versus benign lung nodules. The corresponding data from the ROC plot of FIG.1A are tabulated in FIG.1B. The GBM, SVM, and EN classifiers were the most effective classifiers. [0557] A similar validation was performed using 75% of the dataset for training the classifiers and 25% of the dataset for validation. FIGs.2A-2B show results of a cross validation experiment when 75% of the dataset was considered for training the classifiers while 25% of the dataset was used for validation. [0558] FIG.2A is a ROC plot for an optimization of differentially expressed genes to distinguish malignant lung nodules versus benign lung nodules based on an analysis of RNA-Seq data. The six machine learning classifiers include LOG, GLM, kNN, RF, SVM, and GBM. FIG.2B shows results of exemplary trained machine learning classifier algorithms in an optimization of differentially expressed genes to distinguish malignant lung nodules versus benign lung nodules. The corresponding data from the ROC plot of FIG.2A are tabulated in FIG.2B. The GBM, SVM, and kNN classifiers were the most effective classifiers. [0559] In order to obtain a smaller number of features to classify lung nodules, the top 50 predictive genes from the 7 classifiers that accurately predicted lung nodules (FIGs.1A-1B) were combined. Furthermore, overlapping genes were removed, thereby yielding a gene set of 182 gene features (as shown in Table 1).
Attorney Docket No.225234-718601/PCT ASAP1-IT2 BEX1 DPP9 HP MTFMT POM121 SLC35B3 TUBA4B UMODL1- B C 11 16 2
[0561] Performance of the classifiers using only the 182 gene features as compared to the 1,178 gene features in predicting lung nodules were examined. Performance results of the seven classifiers using a 10-fold cross validation experiment with 182 gene features are shown in FIGs.3A-3B. [0562] FIG.3A is a ROC plot showing performance of seven machine learning classifiers using a set of 182 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. The corresponding data from the ROC plot of FIG.3A are tabulated in FIG.3B. FIG.3B
Attorney Docket No.225234-718601/PCT shows results of exemplary trained machine learning classifier algorithms to analyze RNASeq data using the set of 182 gene features to distinguish malignant lung nodules versus benign lung nodules. [0563] Each cross validation dataset comprised 80% training data and 20% validation data. The results demonstrated that the 182 gene features effectively distinguished malignant lung nodules versus benign lung nodules. In general, use of the 182 genes was more effective than the entire set of 1,178 genes. Furthermore, the GBM and LOG machine learning classifiers achieved better predictive values when 182 gene features were used, as compared to the entire set of 1,178 gene features. The SVM model achieved a specificity decrease of about 0.05, yet overall performance of the SVM model improved, when the set of 182 gene features was used, as compared to the entire set of 1,178 gene features. [0564] Separately, the entire set of 1,178 genes was examined independently in male subjects and female subjects. The GBM machine learning classifier achieved the best predictive performance for male subject, and the NB machine learning classifier achieved the best predictive performance for female subjects, compared to other classifiers. A gene importance was calculated for each gene feature based on a gene feature from the GBM classifier for males, and the rank for the same gene feature in the NB classifier for females. Genes with a gene importance of >50 were selected for inclusion in a smaller subset, thereby producing a set of 175 gene features from the set of 1,178 gene features initially used to perform the predictions. [0565] A similar 10-fold cross validation using an 80% training and 20% validation split of the biomarkers dataset was used to examine the effectiveness of the set of 175 gene features using the eight classifiers. FIG.4A shows the ROC plot of the performance of the classifiers using 175 genes over the entire dataset (males and females). The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. FIG.4B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.4A. [0566] The corresponding data from the ROC plot of FIG.4A are tabulated in FIG.4B. The kNN and EN classifiers achieved better predictive values using the set of 175 gene features as compared to using the set of 182 gene features. [0567] FIG.5A shows the ROC plot of the eight classifiers’ performance using the 175 gene features with a 10-fold validation technique with 80% training and 20% validation split. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. The corresponding data from the ROC plot of FIG.5A are tabulated in FIG.5B. The GBM and SVM classifiers achieved the highest predictive values using the 175 gene features. MAP2K 9 2
Attorney Docket No.225234-718601/PCT EIF2AK ACTN4 CCDC94 4 HABP4 MED28 PDIA4 SEPT11 TMEM218 - L 3 6
Attorney Docket No.225234-718601/PCT [0569] The set of 175 gene features and the set of 182 gene features had a total of shared 62 gene features which overlapped between the two sets. The 62 gene features were examined for their effectiveness in predicting lung nodules using the biomarkers dataset.10-fold cross validation with training to validation split of 75% and 25% was used.6B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.6A. FIG.6A is a ROC plot showing performance of machine learning classifiers using a set of the 62 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. The set of 62 gene features achieved high predictive value across all eight classifiers. ABCF1 BCAT1 DSC2 HOXB2 MOGS PSMD5 SLC35B3 VPS25 ACLY BEX1 EEF1DP3 LAS1L MTFMT RABL6 SPECC1L WDR20 C 1
[0571] Separately, the set of 182 gene features and the set of 175 gene features were combined and overlapping genes were removed to produce a set of 295 gene features. This set of 295 gene features was tested using the biomarkers database to examine the effectiveness in classifying lung cancers. Classifiers were tested using the 295 gene features using a 10-fold cross validation technique with a 75% to 25% split to generate training and validation datasets. FIG.7A is a ROC plot showing performance of machine learning classifiers using a set of 295 gene features generated from RNA-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. [0572] FIG.7B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.7A. All classifiers except GLM achieved high predictive values in classifying lung nodules using the biomarkers dataset.
Attorney Docket No.225234-718601/PCT C1GALT ABCF1 1 DTWD1 HCG27 MKKS PHRF1 SEPT11 TPP1 1 6 2 3 B 1 1 L B
Attorney Docket No.225234-718601/PCT ARHGEF1 0 COA4 FLT3 LINC00925 NT5M RASA3 TCF20 WDR20 2 2 3 C C 7 0 9 2 2
[0574] Results demonstrated that machine learning classifiers performed well to distinguish malignant lung nodules from benign lung nodules. Feature selection was performed to reduce the set of features from 1,178 genes to one of (i) a set of 295 genes, (ii) a set of 182 genes, (iii) a set of 175 genes, or (iv) a set of 62 genes, which achieved positive results in distinguishing malignant lung nodules from benign
Attorney Docket No.225234-718601/PCT lung nodules. In the following examples, larger datasets were investigated to compensate for heterogeneity in clinical data. [0575] The top 50 predictors from seven classifiers were selected and after removing overlapping genes, a set of 142 gene features (Table 5) were obtained. The seven classifiers included the eight classifiers other than the GLM. Gene expression data for the set of 142 gene features were obtained using RNA- Seq. All eight classifiers were trained and validated using the set of 142 gene features over the biomarkers dataset using a 10-fold cross validation technique with 80% to 20% training and validation data split. ABCF1 CEP250 GUSB MIR22HG PLCB1 SAV1 TSPAN33 ABHD3 CHMP4A HDAC3 MIR3939 PLCH1 SCAMP3 UCP2 B C 16
Example 2: Machine Learning Classification of Ampli-Seq Data [0576] A larger dataset from 604 subjects was assembled to examine the effectiveness of the set of 175 gene features in distinguishing malignant versus benign lung nodules. Gene expression measurements of whole blood samples from the subjects were analyzed using Ampli-Seq technique. The training dataset was obtained using Ampli-Seq targeting the 175 genes determined previously. The training dataset comprised 301 lung nodule samples that were known to be benign and 303 samples that were diagnosed
Attorney Docket No.225234-718601/PCT as malignant. Normalized Ampli-Seq read counts (RPM) of the 175 genes were provided as input data to the classifiers. [0577] Results of the eight classifiers in a 10-fold validation using a data split of 80% training data to 20% validation data is shown in FIGs.8A-8B. FIG.8A is a ROC plot showing performance of machine learning classifiers using a set of 175 gene features generated from Ampli-Seq data to distinguish malignant lung nodules versus benign lung nodules. The eight machine learning classifiers include LOG, GLM, kNN, RF, SVM, GBM, NB, and EN. FIG.8B shows tabulated results of exemplary trained machine learning classifier algorithms corresponding to FIG.8A. A similar 10-fold validation was performed using a training to validation data split of 75% to 25%. Example 3: Machine Learning Classification and Validation using Ampli-Seq data [0578] The performance of the machine learning classifiers of Example 2 was validated using a dataset of lung nodule samples from 487 subjects. The validation dataset was obtained using Ampli-Seq targeting the set of 175 genes. The validation dataset comprised 142 lung nodule samples that were diagnosed as being malignant. [0579] Normalized Ampli-Seq read counts (RPM) of the set of 175 genes were provided as input data to the classifiers. The best performing classifier using the set of 175 gene features (LOG) and the set of 85 gene features (GBM) were compared on the validation dataset. Data from the validation dataset was not used to train the classifiers. [0580] FIG.9A is a cumulative fraction of lung nodules predicted by a logistic regression classifier using a set of 175 gene features. FIG.9B is a cumulative fraction of lung nodules predicted by a gradient boosting classifier using the set of 175 gene features. [0581] The cumulative fraction of malignant lung nodules predicted by the LOG model using the set of 175 features (FIG.9A) showed overfitting when compared to the GBM using the set of 85 features (FIG.9B). The LOG classifier identified 266 patients with malignant lung nodules from the total of 487 patients (FIG.9A). Meanwhile, using the subset of 85 genes, the GBM classifier identified 127 out of 142 patients with malignant lung nodules versus benign lung nodules. Example 4: Machine Learning Classification using clinical characteristics data. [0582] A biomarker dataset obtained from 152 subjects was analyzed. Among those, 80 subjects had a diagnosis of a benign lung nodule, and 72 subject had a diagnosis of a malignant lung nodule. A set of 8 clinical characteristics features (Table 6) were examined for their effectiveness in predicting lung nodules using the biomarkers dataset. FIG.12 shows the correlation plot of the 8 clinical characteristics features (Table 6). Clinical Characteristics
Attorney Docket No.225234-718601/PCT SEX (sex of the subject) Table 6: Clinical Ch
[0583] Eight machine learning classifiers including Logistic regression model (LOG), Random forest (RF), Support vector machines (SVM), Decision tree learning (DTREE), Adaptive boosting (ADB), Naïve Bayes (NB), Linear discriminant analysis (LDA), k-nearest neighbors (kNN), and Gradient boosting machines (GBM), were trained to distinguish malignant lung nodules versus benign lung nodules based on clinical characteristics data of the 8 clinical characteristics features (Table 6). [0584] FIG.13A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of the 8 clinical characteristics features (Table 6), to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the biomarkers dataset was used. AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.803, 0.782, 0.393, 0.618, 0.792, 0.806, 0.804, 0.750 and 0.764 respectively. FIG.13B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 8 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules. AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.703, 0.688, 0.351, 0.656, 0.720, 0.710, 0.699, 0.766 and 0.646 respectively. FIG.13C presents the tabulated results of the 9 machine learning classifiers corresponding to FIG.13A. FIG.13D presents feature importance of the 8 clinical characteristics features for the 9 machine learning classifiers. FIG.13E shows feature importance of the 8 clinical characteristics features for all the 9 classifiers. As can be seen from FIGs.13D and E, the three top most contributors or predictors or features were NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), and AGE, with the fourth being NCNMYN (Nodule Spiculated). [0585] Next, the effectiveness of the top 4 features as determined above, e.g. NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), AGE, and NCNMYN (Nodule Spiculated), were examined using the eight classifiers.
Attorney Docket No.225234-718601/PCT [0586] FIG.14A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of the 4 clinical characteristics features, NCNSZE, NCNUPYN, AGE, and NCNMYN to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the biomarkers dataset was used. AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.858, 0.730, 0.840, 0.586, 0.736, 0.811, 0.862, 0.725 and 0.735 respectively. FIG.14B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 4 clinical characteristics features, NCNSZE, NCNUPYN, AGE, and NCNMYN, to distinguish malignant lung nodules versus benign lung nodules. AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.746, 0.703, 0.791, 0.626, 0.598, 0.695, 0.750, 0.653 and 0.689 respectively. FIG.14C presents the tabulated results of the 9 machine learning classifiers corresponding to FIG.14A. FIG.14D presents feature importance of the 4 clinical characteristics features for the 9 machine learning classifiers. FIG.14E shows feature importance of the 4 clinical characteristics features for all the 9 classifiers. As can be seen from FIGs. 13A and 14A, performance of the classifiers when used top 4 predictors (NCNSZE, NCNUPYN, AGE, and NCNMYN) shows better performances than all 8 predictors (Table 6). [0587] A larger dataset from 604 subjects was assembled to examine the effectiveness of the clinical features in distinguishing malignant versus benign lung nodules. Among those, 301of the samples in the biomarker dataset had a diagnosis of a benign lung nodule, and 303 samples had a diagnosis of a malignant lung nodule. A set of 9 clinical characteristics features (clinical characteristics in Table 6, and cancer history - Y/N)) were examined for their effectiveness in predicting lung nodules using the larger dataset. [0588] FIG.15A shows ROC plots showing performance of the 9 machine learning classifiers using clinical characteristics data of the 9 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the larger dataset was used. AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.773, 0.745, 0.730, 0.661, 0.771, 0.786, 0.768, 0.654 and 0.757 respectively. FIG.15B shows Precision/Recall curve of the 9 machine learning classifiers using clinical characteristics data of the 9 clinical characteristics features, to distinguish malignant lung nodules versus benign lung nodules. AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.747, 0.690, 0.673, 0.740, 0.759, 0.746, 0.743, 0.633 and 0.707 respectively. FIG.15C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.15A. FIG.15D shows feature importance of the 9 clinical characteristics features for the 9 machine learning classifiers. FIG.15E shows feature importance of the 9 clinical characteristics features for all the 9 models. As can be seen from FIGs.15D and E, the three top most contributors or predictors or features were NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), and AGE.
Attorney Docket No.225234-718601/PCT Example 5: Machine Learning Classification using gene expression data and clinical characteristics data. [0589] Based on the results, obtained in the above examples, a combination of a set of 142 gene features (Table 5), and a set of 3 clinical characteristics features were examined for their effectiveness in predicting lung nodules. The 142 gene features were selected based on results of Example 1. The 3 clinical characteristics features, NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), and AGE, were selected based on the results of Example 4. Gene expression measurements were from whole blood samples of the subjects. A combined biomarker dataset comprising samples from the 152 subjects was analyzed. Among those, 80 subjects had a diagnosis of a benign lung nodule, and 72 subjects had a diagnosis of a malignant lung nodule. [0590] FIG.16A shows ROC plots showing performance of the 9 machine learning classifiers using gene expression data of the 142 gene features, and clinical characteristics data of the 3 clinical characteristics to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the combined dataset was used. AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.919, 0.819, 0.829, 0.660, 0.690, 0.783, 0.905, 0.826 and 0.795 respectively. FIG.16B shows Precision/Recall curve of the 9 machine learning classifiers using gene expression data of the 142 gene features, and clinical characteristics data of the 3 clinical characteristics to distinguish malignant lung nodules versus benign lung nodules. AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.854, 0.780, 0.756, 0.632, 0.619, 0.663, 0.754, 0.764 and 0.687 respectively. FIG.16C shows the tabulated results of the 9 machine learning classifiers corresponding to FIG.16A. FIG.16D presents the tabulated results of the 9 machine learning classifiers corresponding to FIG.16A, with oversampling correction applied (e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule). As can be seen from FIGs.16C and D relatively high predictive value can achieved using the set 142 gene features (Table 5), and a set of 3 clinical characteristics NCNSZE, NCNUPYN, and AGE as features. The top two contributors or predictors or features were nodule size and BCAT1 gene. Table 7 shows the top 34 predictors obtained from the machine learning classifier using the combined dataset of Example 5. Table 7 contains 31 lung- disease associated genes and 3 clinical characteristics (e.g. NCNSZE, NCNUPYN, and AGE). Predictors
Attorney Docket No.225234-718601/PCT VPS37C AGE Table 7: Top 34 predictors from
[0591] Next, the top 34 predictors were examined for their effectiveness in predicting lung nodules. A biomarker data set for the top 34 predictors were obtained from the 152 subjects. As described above, among those, 80 subjects had a diagnosis of a benign lung nodule, and 72 subjects had a diagnosis of a malignant lung nodule. The top 34 predictors contains 31 genes and NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), and AGE, as predictors. [0592] FIG.17A shows ROC plots showing performance of the 9 machine learning classifiers using measurement data (e.g. gene expression data or clinical characteristics data as appropriate) of the 34
Attorney Docket No.225234-718601/PCT predictors to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the biomarkers dataset was used. AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.992, 0.867, 0.950, 0.675, 0.800, 0.854, 0.963, 0.835 and 0.842 respectively. FIG.17B shows Precision/Recall curve of the 9 machine learning classifiers using measurement data of the 34 predictors to distinguish malignant lung nodules versus benign lung nodules. AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.988, 0.807, 0.931, 0.687, 0.747, 0.815, 0.943, 0.814 and 0.811 respectively. FIG.17C presents the tabulated results of the machine learning classifiers LOG and RF corresponding to FIG.17A. FIG.17D presents the tabulated results of the 9 machine learning classifiers corresponding to FIG.17A, with oversampling correction applied (e.g.80 sample with benign lung nodule, and 80 samples with malignant lung nodule). FIG.17E shows feature importance of the 34 features for all the 9 classifiers. As can be seen from FIGs. 17C and D relatively high predictive value can achieved using the 34 predictors containing the set of genes and clinical characteristics of Table 7. Example 6: Machine Learning Classification using gene expression data and clinical characteristics data. [0593] A combination of a set of 175 gene features (Table 2), and a set of 4 clinical characteristics features were examined for their effectiveness in predicting lung nodules. The 175 gene features were selected based on results of Examples 1, 2 and 3. The 4 clinical characteristics features, NCNSZE (nodule size), NCNUPYN (nodule in the upper lobe), AGE, and NCNMYN (Nodule Spiculated), were selected based on the results of Example 4. Gene expression measurements were from whole blood samples of the subjects. A combined biomarker dataset containing measurement data of the 179 features (e.g.175 gene features and 4 clinical characteristics features) from the 152 subjects was analyzed. As described above, among those, 80 subjects had a diagnosis of a benign lung nodule, and 72 subjects had a diagnosis of a malignant lung nodule. [0594] FIG.18A shows ROC plots showing performance of the 9 machine learning classifiers using gene expression data of the 175 gene features, and a clinical characteristics data of the 4 clinical characteristics to distinguish malignant lung nodules versus benign lung nodules.10-fold cross validation using an 80% training and 20% validation split of the combined biomarkers dataset was used. AUC of the ROC plots for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.674, 0.698, 0.669, 0.702, 0.723, 0.657, 0.630, 0.560 and 0.784 respectively. FIG.18B shows Precision/Recall curve of the 9 machine learning classifiers using gene expression data of the 175 gene features, and a clinical characteristics data of the 4 clinical characteristics to distinguish malignant lung nodules versus benign lung nodules. AUC of the Precision/Recall curve for the 9 machine learning classifiers LOG, RF, SVM, DTREE, ADB, NB, LDA, kNN, and GBM, are 0.635, 0.724, 0.664, 0.727, 0.663, 0.630, 0.544, 0.550 and 0.729 respectively. FIG.18C shows the tabulated results of the 9 machine
Attorney Docket No.225234-718601/PCT learning classifiers corresponding to FIG.18A. Table 8 shows the top 22 predictors obtained from the machine learning classifier using the combined dataset of Example 6. Predictors NCNSZE Table 8: Top 22 predictors from Exa
Example 7: Machine Learning Classification of Pancreatic tumor [0595] Differential gene expression analysis can be performed to identify genes that are most differentially expressed between benign pancreatic tumors and malignant pancreatic tumors. Biomarker datasets containing gene expression measurements of whole blood samples from of a plurality of subjects, and clinical characteristics data of the plurality of subjects can be analyzed. The gene expression measurements of the whole blood samples can be analyzed using RNA-Seq technique. Among the plurality of subjects, some of the subjects can have a diagnosis of a benign pancreatic tumor, and some other of the subjects can have a diagnosis of a malignant pancreatic tumor. One or more machine learning classifiers selected from Gradient boosting machines (GBM), Logistic regression
Attorney Docket No.225234-718601/PCT model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Naïve Bayes (NB), Elastic Networks (EN), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), Ridge regression, and Lasso regression, can be trained to distinguish malignant pancreatic tumors versus benign pancreatic tumors based on analysis of the RNA-Seq data, and clinical characteristics data. [0596] A first group of genes were initially identified to be differentially expressed between samples from subjects containing malignant pancreatic tumors and samples from subjects containing benign pancreatic tumors. A Log2ratio of gene expression of the differentially expressed genes can be used to determine a first gene set containing a group of genes related to pancreatic cancer. The Log2 ratio is defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample. The first gene set can be obtained from the first group of genes after removing a subset of genes that exhibit collinear expression (for example correlation or r > 0.8). A first combined biomarker data set containing genes of the first gene set, and clinical characteristics selected from a group of clinical characteristics related to pancreatic cancer, as features can be examined for their effectiveness in classifying pancreatic tumors. Performance of the machine learning classifiers using gene expression data of the genes of the first gene set, and clinical characteristics data of the clinical characteristics related to pancreatic cancer to distinguish malignant pancreatic tumors versus benign pancreatic tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the first combined dataset can be used. The feature importance for the classifiers can be determined. Based on feature importance values a first optimal predictor set containing a first optimal gene set and a first optimal clinical characteristics set can be selected. The first optimal predictor set can be obtained by combining top 50 predictors of the machine learning classifier being used, and removing the duplicating features. [0597] A second combined biomarker data set (from the plurality of subjects), containing genes of the first optimal gene set, and clinical characteristics of the first optimal clinical characteristics set, as features can be examined for their effectiveness in classifying pancreatic tumors. Performance of the machine learning classifiers using gene expression data of the genes of the first optimal gene set, and clinical characteristics data of the clinical characteristics of the first optimal clinical characteristics set, to distinguish malignant pancreatic tumors versus benign pancreatic tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the second combined dataset can be used. The Machine learning models can distinguish malignant pancreatic tumors versus benign pancreatic tumors with relatively high accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, using gene expression data of the genes of the first optimal gene set, and clinical characteristics data of the clinical characteristics of the first optimal clinical characteristics set. The first optimal gene set can be capable of classifying a pancreatic tumor as benign or malignant.
Attorney Docket No.225234-718601/PCT Example 8: Machine Learning Classification of Ovarian tumor [0598] Differential gene expression analysis can be performed to identify genes that are most differentially expressed between benign ovarian tumors and malignant ovarian tumors. Biomarker datasets containing gene expression measurements of whole blood samples from of a plurality of subjects, and clinical characteristics data of the plurality of subjects can be analyzed. The gene expression measurements of the whole blood samples can be analyzed using RNA-Seq technique. Among the plurality of subjects, some of the subjects can have a diagnosis of a benign ovarian tumor, and some other of the subjects can have a diagnosis of a malignant ovarian tumor. One or more machine learning classifiers selected from Gradient boosting machines (GBM), Logistic regression model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Naïve Bayes (NB), Elastic Networks (EN), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), Ridge regression, and Lasso regression, can be trained to distinguish malignant ovarian tumors versus benign ovarian tumors based on analysis of the RNA-Seq data, and clinical characteristics data. [0599] A second group of genes were initially identified to be differentially expressed between samples from subjects containing malignant ovarian tumors and samples from subjects containing benign ovarian tumors. A Log2ratio of gene expression of the differentially expressed genes can be used to determine a second gene set, containing a group of genes related to ovarian cancer. The Log2 ratio is defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample. The second gene set can be obtained from the second group genes (after removing a subset of genes that exhibit collinear expression (for example correlation or r > 0.8). A first combined biomarker data set containing genes of the second gene set, and clinical characteristics selected from a group of clinical characteristics related to ovarian cancer, as features can be examined for their effectiveness in classifying ovarian tumors. Performance of the machine learning classifiers using gene expression data of the genes of the second gene set, and clinical characteristics data of the clinical characteristics related to ovarian cancer to distinguish malignant ovarian tumors versus benign ovarian tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the first combined dataset can be used. The feature importance for the classifiers can be determined. Based on feature importance values a second optimal predictor set containing a second optimal gene set and an optimal clinical characteristics set can be selected. The second optimal predictor set can be obtained by combining top 50 predictors of the machine learning classifier being used, and removing the duplicating features. [0600] A second combined biomarker data set (from the plurality of subjects), containing genes of the second optimal gene set, and clinical characteristics of the second optimal clinical characteristics set, as features can be examined for their effectiveness in classifying ovarian tumors. Performance of the machine learning classifiers using gene expression data of the genes of the second optimal gene set, and clinical characteristics data of the clinical characteristics of the second optimal clinical characteristics set,
Attorney Docket No.225234-718601/PCT to distinguish malignant ovarian tumors versus benign ovarian tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the second combined dataset can be used. The Machine learning models can distinguish malignant ovarian tumors versus benign ovarian tumors with relatively high accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, using gene expression data of the genes of the second optimal gene set, and clinical characteristics data of the clinical characteristics of the second optimal clinical characteristics set. The second optimal gene set can be capable of classifying an ovarian tumor as benign or malignant. Example 9: Machine Learning Classification of Brain tumor [0601] Differential gene expression analysis can be performed to identify genes that are most differentially expressed between benign brain tumors and malignant brain tumors. Biomarker datasets containing gene expression measurements of whole blood samples from of a plurality of subjects, and clinical characteristics data of the plurality of subjects can be analyzed. The gene expression measurements of the whole blood samples can be analyzed using RNA-Seq technique. Among the plurality of subjects, some of the subjects can have a diagnosis of a benign brain tumor, and some other of the subjects can have a diagnosis of a malignant brain tumor. One or more machine learning classifiers selected from Gradient boosting machines (GBM), Logistic regression model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Naïve Bayes (NB), Elastic Networks (EN), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), Ridge regression, and Lasso regression, can be trained to distinguish malignant brain tumors versus benign brain tumors based on analysis of the RNA-Seq data, and clinical characteristics data. [0602] A third group of genes were initially identified to be differentially expressed between samples from subjects containing malignant brain tumors and samples from subjects containing benign brain tumors. A Log2ratio of gene expression of the differentially expressed genes can be used to determine a third gene set, containing a group of genes related to brain cancer. The Log2 ratio is defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample. The third gene set can be obtained from the third group of genes after removing a subset of genes that exhibit collinear expression (for example correlation or r > 0.8). A first combined biomarker data set containing genes of the third gene set, and clinical characteristics selected from a group of clinical characteristics related to brain cancer, as features can be examined for their effectiveness in classifying brain tumors. Performance of the machine learning classifiers using gene expression data of the genes of the third gene set, and clinical characteristics data of the clinical characteristics related to brain cancer to distinguish malignant brain tumors versus benign brain tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the first combined dataset can be used. The feature importance for the classifiers can be determined. Based on feature importance values a third optimal predictor set containing a third optimal gene set and a third optimal clinical characteristics set can be selected. The third optimal predictor set can be obtained by
Attorney Docket No.225234-718601/PCT combining top 50 predictors of the machine learning classifier being used, and removing the duplicating features. [0603] A second combined biomarker data set (from the plurality of subjects), containing genes of the third optimal gene set, and clinical characteristics of the third optimal clinical characteristics set, as features can be examined for their effectiveness in classifying brain tumors. Performance of the machine learning classifiers using gene expression data of the genes of the third optimal gene set, and clinical characteristics data of the clinical characteristics of the third optimal clinical characteristics set, to distinguish malignant brain tumors versus benign brain tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the second combined dataset can be used. The Machine learning models can distinguish malignant brain tumors versus benign brain tumors with relatively high accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, using gene expression data of the genes of the third optimal gene set, and clinical characteristics data of the clinical characteristics of the third optimal clinical characteristics set. The third optimal gene set can be capable of classifying a brain tumor as benign or malignant. Example 10: Machine Learning Classification of Kidney tumor [0604] Differential gene expression analysis can be performed to identify genes that are most differentially expressed between benign kidney tumors and malignant kidney tumors. Biomarker datasets containing gene expression measurements of whole blood samples from of a plurality of subjects, and clinical characteristics data of the plurality of subjects can be analyzed. The gene expression measurements of the whole blood samples can be analyzed using RNA-Seq technique. Among the plurality of subjects, some of the subjects can have a diagnosis of a benign kidney tumor, and some other of the subjects can have a diagnosis of a malignant kidney tumor. One or more machine learning classifiers selected from Gradient boosting machines (GBM), Logistic regression model (LOG), Support vector machines (SVM), Random forest (RF), Generalized linear model (GLM), k-nearest neighbors (kNN), Naïve Bayes (NB), Elastic Networks (EN), deep learning algorithm, linear discriminant analysis (LDA), a decision tree learning (DTREE), adaptive boosting (ADB), Ridge regression, and Lasso regression, can be trained to distinguish malignant kidney tumors versus benign kidney tumors based on analysis of the RNA-Seq data, and clinical characteristics data. [0605] A fourth group of genes were initially identified to be differentially expressed between samples from subjects containing malignant kidney tumors and samples from subjects containing benign kidney tumors. A Log2ratio of gene expression of the differentially expressed genes can be used to determine a fourth gene set, containing a group of genes related to kidney cancer. The Log2 ratio is defined as T/R, where T is the gene expression level in the testing sample, and R is the gene expression level in the reference sample. The fourth gene set can be obtained from the fourth group of genes after removing a subset of genes that exhibit collinear expression (for example correlation or r > 0.8). A first combined biomarker data set containing genes of the fourth gene set, and clinical characteristics selected from a group of clinical characteristics related to kidney cancer, as features can be examined for their
Attorney Docket No.225234-718601/PCT effectiveness in classifying kidney tumors. Performance of the machine learning classifiers using gene expression data of the genes of the fourth gene set, and clinical characteristics data of the clinical characteristics related to kidney cancer to distinguish malignant kidney tumors versus benign kidney tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the first combined dataset can be used. The feature importance for the classifiers can be determined. Based on feature importance values a fourth optimal predictor set containing a fourth optimal gene set and a fourth optimal clinical characteristics set can be selected. The fourth optimal predictor set can be obtained by combining top 50 predictors of the machine learning classifier being used, and removing the duplicating features. [0606] A second combined biomarker data set (from the plurality of subjects), containing genes of the fourth optimal gene set, and clinical characteristics of the fourth optimal clinical characteristics set, as features can be examined for their effectiveness in classifying kidney tumors. Performance of the machine learning classifiers using gene expression data of the genes of the fourth optimal gene set, and clinical characteristics data of the clinical characteristics of the fourth optimal clinical characteristics set, to distinguish malignant kidney tumors versus benign kidney tumors can be determined.10-fold cross validation using an 80% training and 20% validation split of the second combined dataset can be used. The Machine learning models can distinguish malignant kidney tumors versus benign kidney tumors with relatively high accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, using gene expression data of the genes of the fourth optimal gene set, and clinical characteristics data of the clinical characteristics of the fourth optimal clinical characteristics set. The fourth optimal gene set can be capable of classifying a kidney tumor as benign or malignant. Example 11: Obtaining biomarker gene sets for classifying whether a patient has cancer [0607] An initial dataset containing gene expression measurement of genes of an initial gene set from a plurality of reference samples is obtained. The plurality of reference samples contains a first plurality of reference samples obtained or derived from subjects having cancer, and second plurality of reference samples obtained or derived from subjects not having cancer.2000 most variably expressed genes of the initial dataset are selected. The 2000 genes are clustered using PPI-based MCODE. The PPI-based MCODE gene clusters were used as feature inputs for SGFI algorithm. Multiple subsample iterations are run, and cluster sets best classified subjects having cancer, from subject not having cancer are selected. For each cluster set (e.g., feature set), a radial SVM model is created and hyperparameters are tuned.10- fold CV is performed and feature set having highest F1-score is selected. Differentially expressed genes from the selected feature set is selected to obtained the gene set capable of classifying whether a patient has cancer. [0608] In one experiment, the first plurality of reference samples are obtained or derived from subjects having kidney cancer, and second plurality of reference samples obtained or derived from subjects not having kidney cancer, and the gene set obtained is capable of classifying whether a patient has kidney cancer.
Attorney Docket No.225234-718601/PCT [0609] In another experiment, the first plurality of reference samples are obtained or derived from subjects having brain cancer, and second plurality of reference samples obtained or derived from subjects not having brain cancer, and the gene set obtained is capable of classifying whether a patient has brain cancer. [0610] In another experiment, the first plurality of reference samples are obtained or derived from subjects having ovarian cancer, and second plurality of reference samples obtained or derived from subjects not having ovarian cancer, and the gene set obtained is capable of classifying whether a patient has ovarian cancer. [0611] In another experiment, the first plurality of reference samples are obtained or derived from subjects having pancreatic cancer, and second plurality of reference samples obtained or derived from subjects not having pancreatic cancer, and the gene set obtained is capable of classifying whether a patient has pancreatic cancer. [0612] In another experiment, the first plurality of reference samples are obtained or derived from subjects having lung cancer, and second plurality of reference samples obtained or derived from subjects not having lung cancer, and the gene set obtained is capable of classifying whether a patient has lung cancer. Example 12: Obtaining biomarker gene sets for classifying whether a patient has lung cancer [0613] Method 1: Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) is carried out on transcriptomic profiles of lung cancer subjects. MEGENA generated co-expression modules, significantly correlated to clinical feature “Diagnosis” are used as features for sequential grouped feature importance (SGFI) algorithm. The SGFI identifies the best combination of features that can distinguish the malignant from benign lung cancer samples. The model starts with null model and then adds the next best feature sequentially in a leave-one-group fashion until no improvement in the model metrics are observed. MEGENA modules that are as significantly correlated to diagnosis clinical variable are identified. The SGFI algorithm identifies best combination of feature groups among the identified MEGENA modules that can best classify the malignant from benign lung cancer nodules. The best feature groups identified by SGFI are plugged in as final features and machine learning classifiers were built to distinguish the malignant from benign cancer samples. The MEGENA and SGFI are implemented in R. [0614] Method 2: Differential Gene Expression (DEG) analysis is performed between the malignant and benign lung cancer nodules using limma function in R. The significant DE genes with (FDR pval < 0.05) are used as features for SGFI algorithm to identify the best combination of features that can classify the malignant from benign cancer samples with high accuracy. [0615] Method 3: Differential Gene Expression (DEG) analysis is performed between the malignant and benign lung cancer nodules using limma function in R. MCODE was performed to on the significant DE genes (FDR pval < 0.05) to identify the protein-protein interactions networks. The PPI based
Attorney Docket No.225234-718601/PCT MCODE clusters are used as features for SGFI algorithm to identify the best combination of feature groups that classify the malignant from benign lung cancer samples with high accuracy. [0616] While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
Claims
Attorney Docket No.225234-718601/PCT CLAIMS 1. A method for determining a gene set capable of classifying a solid tumor as benign or malignant without biopsy, the method comprising: a) obtaining a reference data set comprising a plurality of individual reference data sets, wherein a respective individual reference data set of the plurality of individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and, wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) training a machine learning model using the reference data set, wherein the machine learning model is trained to infer whether a solid tumor is benign or malignant based at least in part on gene expression measurements of the plurality of genes, and optionally clinical characteristics data of the one or more clinical characteristics; c) determining feature importance values of the plurality of genes; and d) determining a gene set based at least in part on the feature importance values. 2. The method of claim 1, wherein the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor; (ii) a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer; (iii) an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer; (iv) a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer; (v) a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer; (vi) a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer; (vii) an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer; (viii) a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer; or (ix) a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. 3. A method for developing a trained machine learning model capable of classifying a solid tumor of a patient as benign or malignant, the method comprising: (a) obtaining a first reference data set comprising a plurality of first individual reference data sets, wherein a respective first individual reference data set of the plurality of first individual reference data sets comprises i) gene expression measurements of a plurality of genes of a reference biological sample from a reference subject having a reference solid tumor, ii) data regarding
Attorney Docket No.225234-718601/PCT whether the reference solid tumor is benign or malignant, and iii) optionally clinical characteristics data of one or more clinical characteristics of the reference subject, and wherein the reference biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) training a first machine learning model using the first reference data set, wherein the first machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on one or more predictors selected from the plurality of genes, and optionally the one or more clinical characteristics; (c) determining feature importance values of the one or more predictors of the first machine learning model; (d) selecting A predictors of the first machine learning model based at least in part on the feature importance values, wherein A is an integer from 5 to 2000; and (e) training a second machine learning model based at least in part on a second reference data set comprising a plurality of second individual reference data sets, wherein a respective second individual reference data set of the plurality of second individual reference data sets comprises i) measurement data of the A predictors of the reference subject, and ii) data regarding whether the reference solid tumor of the reference subject is benign or malignant, to obtain the trained machine learning model, wherein the trained machine learning model is trained to infer whether a solid tumor is benign or malignant, based at least in part on measurement data of the A predictors. 4. The method of claim 3, wherein the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor; (ii) a pancreatic tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to pancreatic cancer; (iii) an ovarian tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to ovarian cancer; (iv) a brain tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to brain cancer; (v) a kidney tumor, and the plurality of genes comprises at least 2 genes selected from a group of genes related to kidney cancer; (vi) a pancreatic tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer; (vii) an ovarian tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer; (viii) a brain tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer; (ix) a kidney tumor, and the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. 5. The method of claim 3 or 4, wherein the A predictors have top 5 to 200 feature importance values. 6. The method of any one of claims 3 to 5, wherein the trained machine learning model has: (i) an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about
Attorney Docket No.225234-718601/PCT 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) a sensitivity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) a specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iv) a positive predictive value at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (v) a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; or (vi) a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. 7. The method of any one of claims 3 to 6, wherein the first machine learning model and second machine learning model is independently trained using a linear regression, a logistic regression, a Ridge regression, a Lasso regression, an elastic net (EN) regression, a support vector machine (SVM), a gradient boosted machine (GBM), a k nearest neighbors (kNN), a generalized linear model (GLM), a naïve Bayes (NB) classifier, a neural network, a Random Forest (RF), a deep learning algorithm, a linear discriminant analysis (LDA), a decision tree learning (DTREE), an adaptive boosting (ADB), or any combination thereof. 8. A method for assessing a solid tumor of a patient, the method comprising: a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of claim 1, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; b) providing the data set as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; c) receiving, as an output of the machine-learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and d) electronically outputting a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor.
Attorney Docket No.225234-718601/PCT 9. The method of claim 8, wherein the solid tumor is: (i) a pancreatic tumor, an ovarian tumor, a kidney tumor, or a brain tumor. 10. The method of claim 8, wherein the solid tumor is a pancreatic tumor, and the at least 2 genes are selected from the gene set of claim 1 or 2, wherein the gene set is capable of classifying the pancreatic tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to pancreatic cancer. 11. The method of claim 8, wherein the solid tumor is an ovarian tumor, and the at least 2 genes are selected from the gene set of claim 1 or 2, wherein the gene set is capable of classifying the ovarian tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to ovarian cancer. 12. The method of claim 8, wherein the solid tumor is a brain tumor, and the at least 2 genes selected are from the gene set of claim 1 or 2, wherein the gene set is capable of classifying the brain tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to brain cancer. 13. The method of claim 8, wherein the solid tumor is a kidney tumor, and the at least 2 genes selected are from the gene set of claim 1 or 2, wherein the gene set is capable of classifying the kidney tumor as benign or malignant; and optionally wherein the one or more clinical characteristics are selected from a group of clinical characteristics related to kidney cancer. 14. The method of any one of claims 8 to 13, wherein the machine-learning model is trained according to the method of any one of claims 11 to 28. 15. The method of any one of claims 8 to 14, wherein: (i) the patient has cancer; (ii) the patient does not have cancer; (iii) the patient is at an elevated risk of having cancer; or (iv) the patient is asymptomatic for cancer; optionally wherein the cancer is a pancreatic cancer, an ovarian cancer, or a brain cancer. 16. The method of any one of claims 8 to 15, further comprising administering a treatment based on a solid tumor of a patient being classified as malignant. 17. The method of claim 16, wherein the treatment is surgery, chemotherapy, targeted therapy, immunotherapy, radiotherapy, or any combination thereof. 18. The method of any one of claims 8 to 17, wherein the inference includes a confidence value between 0 and 1 that the solid tumor is malignant. 19. The method of any one of claims 8 to 18, comprising: (i) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about
Attorney Docket No.225234-718601/PCT 99%, or more than about 99%; (ii) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a specificity at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.; (iv) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; or (v) classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 20. The method of any one of claims 8 to 19, wherein a machine learning model is trained and has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99. 21. A method for treating cancer in a patient having a solid tumor, the method comprising: (a) obtaining a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of claim 1 or 2, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; (b) providing the data set as an input to a trained machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; (c) receiving, as an output of the machine learning model, the inference indicating whether the data set is indicative of the malignant solid tumor or the benign solid tumor; and (d) administering a treatment based on a solid tumor of a patient being classified as a malignant tumor. 22. The method of claim 21, wherein the cancer is a pancreatic cancer, ovarian cancer, kidney cancer, or brain cancer.
Attorney Docket No.225234-718601/PCT 23. A system for assessing a solid tumor of a patient, the system comprising: one or more processors; and one or more memories storing executable instructions that, as a result of execution by the one or more processors, cause the system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of claims 1 or 2, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide a dataset as an input to a machine-learning model trained to generate an inference of whether the data set is indicative of a malignant solid tumor or a benign solid tumor; receive, as an output of the machine learning model, the inference indicating whether a composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. 24. A non-transitory computer-readable medium storing executable instructions for assessing a solid tumor of a patient that, as a result of execution by one or more processors of a computer system, cause the computer system to: obtain, from a data base, a data set comprising i) gene expression measurements of a biological sample from the patient, of at least 2 genes selected from the gene set of claims 1 or 2, and ii) optionally clinical characteristics data of one or more clinical characteristics of the patient, wherein the biological sample is a blood sample, isolated peripheral blood mononuclear cells (PBMCs), or any derivative thereof; provide a data set as an input to a machine learning model trained to generate an inference of whether the data set is indicative a malignant solid tumor or a benign solid tumor; receive, as an output of the machine learning model, the inference indicating whether a composite data set is indicative of the malignant solid tumor or the benign solid tumor; and generate a report classifying the solid tumor of the patient as the malignant solid tumor or the benign solid tumor. 25. A method for obtaining a gene set capable of classifying whether a patient has cancer, the method comprising: (a) providing a dataset as an input to a machine learning classifier, said dataset comprises or is derived from gene expression measurements from a plurality of reference samples, of genes listed in a plurality of gene modules, wherein the plurality of reference samples comprises a
Attorney Docket No.225234-718601/PCT first plurality of reference samples obtained or derived from reference subjects having cancer, and a second plurality of reference samples obtained or derived from reference subjects not having cancer; (b) performing feature selection to select a subset of gene modules from the plurality of gene modules, wherein the plurality of gene modules form features of the machine learning classifier; and (c) selecting differentially expressed genes from the genes listed in the subset of gene modules to obtain the gene set capable of classifying whether the patient has cancer. 26. The method of claim 25, wherein the machine learning classifier is sequential grouped feature importance (SGFI) algorithm. 27. The method of claim 25 or 26, wherein the feature selection comprises starting from a featureless model, and sequentially adding next best feature using leave-one-group-in importance (LOGI) until no further improvement in mean misclassification error (MMCE) over an improvement threshold is achieved. 28. The method of claim 27, wherein the improvement threshold is 0.00001, 0.00005, 0.0001, 0.0005, or 0.001. 29. The method of any one of claims 25 to 28, wherein the dataset is a batch corrected dataset. 30. The method of any one of claims 25 to 29, wherein the plurality of gene modules are obtained by a method comprising: providing an initial data set comprising gene expression measurement from the plurality of reference samples, of genes of an initial gene set; selecting M genes from the initial gene set, wherein said M genes are M variably expressed genes of the initial data set, and wherein M is an integer; and clustering the M genes to obtain the plurality of gene modules. 31. The method of claim 30, wherein the M genes are clustered based on protein-protein interaction of proteins encoded by the M genes. 32. The method of claim 30 to 31, wherein the M genes are M most variably expressed genes of the initial data set. 33. The method of any one of claims 30 to 32, wherein M is 500 to 10000. 34. The method of any one of claims 25 to 33, further comprising analyzing a patient data set comprising or derived from gene expression measurement of at least 2 genes selected from the genes within the gene set obtained in step (c) to classify whether a patient has cancer, wherein the gene expression measurement is obtained from a biological sample obtained or derived from the patient. 35. The method of claim 34, wherein the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
Attorney Docket No.225234-718601/PCT 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 110, 120, 130, 140, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, or all genes selected from the genes within the gene set obtained in step (c). 36. The method of claim 34, wherein the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c). 37. The method of any one of claims 34 to 36, wherein the method: (i) classifies whether the patient has cancer with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) classifies whether the patient has cancer with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) classifies whether the patient has cancer with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iv) classifies whether the patient has cancer with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; or (v) classifies whether the patient has cancer with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%. 38. The method of any one of claims 34 to 37, wherein analyzing the patient data set comprises providing the patient data set as an input to a machine-learning model trained to generate an inference indicative of whether the patient has cancer based on the patient data set. 39. The method of claim 38, further comprising: a) receiving as an output of the machine-learning model the inference; and b) electronically outputting a report classifying whether the patient has cancer based on the inference. 40. The method of claim 38 or 39, wherein the machine-learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree
Attorney Docket No.225234-718601/PCT learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof. 41. The method of any one of claims 38 to 40, wherein the machine-learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer. 42. The method of any one of claims 34 to 41, wherein the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof. 43. The method of any one of claims 34 to 42, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer. 44. The method of claim 43, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof. 45. The method of any one of claims 25 to 44, wherein the cancer is a solid cancer; optionally wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer. 46. The method of any one of claims 25 to 44, wherein the cancer is a blood cancer; optionally wherein the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS- related lymphoma, multiple myeloma, plasmacytoma, post-transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. 47. A method for classifying whether a patient has cancer, the method comprising: providing a patient data set comprising or derived from gene expression measurements of at least 2 genes selected from genes within the gene set obtained in step (c) of any one of claims 25- 33 as an input to a machine learning model trained to generate an inference of whether the patient data set is indicative of the patient having cancer; receiving, as an output of the machine learning model the inference; and electronically outputting a report classifying whether the patient has cancer based on the inference,
Attorney Docket No.225234-718601/PCT wherein the gene expression measurements are obtained from a biological sample obtained or derived from the patient. 48. The method of claim 47, wherein the patient data set comprises or is derived from gene expression measurements of at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200 or all genes selected from the genes within the gene set obtained in step (c) of any one of claims 25-33. 49. The method of claim 47, wherein the patient data set comprises or is derived from gene expression measurements of the genes of the gene set obtained in step (c). 50. The method of any one of claims 47 to 49, wherein the patient data set is derived from the gene expression measurements using gene set variation analysis (GSVA), gene set enrichment analysis (GSEA), enrichment algorithm, multiscale embedded gene co-expression network analysis (MEGENA), weighted gene co-expression network analysis (WGCNA), differential expression analysis, Z-score, log2 expression analysis, or any combination thereof. 51. The method of any one of claims 47 to 49, wherein the patient data set is derived from the gene expression measurements using GSVA. 52. The method of any one of claims 47 to 51, wherein the machine learning model is trained using linear regression, logistic regression (LOG), Ridge regression, Lasso regression, elastic net (EN) regression, support vector machine (SVM), gradient boosted machine (GBM), k nearest neighbors (kNN), generalized linear model (GLM), naïve Bayes (NB) classifier, neural network, Random Forest (RF), deep learning algorithm, linear discriminant analysis (LDA), decision tree learning (DTREE), adaptive boosting (ADB), Classification and Regression Tree (CART), hierarchical clustering, or any combination thereof. 53. The method of any one of claims 47 to 52, wherein the method classifies whether the patient has cancer: (i) with an accuracy of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (ii) with a sensitivity of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iii) with specificity of at least about 80%, at least about 85%, at least about 90%, at least about 91%,
Attorney Docket No.225234-718601/PCT at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (iv) with a positive predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%; (v) with a negative predictive value of at least about 80%, at least about 85%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more than about 99%.. 54. The method of any one of claims 47 to 53, wherein the machine learning model has a receiver operating characteristic (ROC) curve with an Area-Under-Curve (AUC) of at least about 0.90, at least about 0.91, at least about 0.92, at least about 0.93, at least about 0.94, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, at least about 0.99, or more than about 0.99, for classifying whether the patient has cancer. 55. The method of any one of claims 47 to 54, wherein the biological sample comprises a blood sample, isolated peripheral blood mononuclear cells (PBMCs), a tissue biopsy sample, or any derivative thereof. 56. The method of any one of claims 47 to 55, wherein the cancer is a solid cancer; optionally wherein the solid cancer is adrenal cancer, anal cancer, bile duct cancer, bladder cancer, bone cancer, brain cancer, breast cancer, carcinoid cancer, cervical cancer, colorectal cancer, esophageal cancer, eye cancer, gallbladder cancer, gastrointestinal stromal tumor, germ cell cancer, head and neck cancer, kidney cancer, liver cancer, lung cancer, nasal cavity and paranasal sinus cancer, nasopharyngeal cancer, neuroblastoma, neuroendocrine cancer, oral cancer, oropharyngeal cancer, ovarian cancer, pancreatic cancer, pediatric cancer, penile cancer, pituitary cancer, prostate cancer, skin cancer, soft tissue cancer, spinal cord cancer, stomach cancer, testicular cancer, thymus cancer, thyroid cancer, ureteral cancer, uterine cancer, vaginal cancer, metastatic renal cell carcinoma, melanoma, carcinoma, a sarcoma, or vulvar cancer. 57. The method of any one of claims 47 to 55, wherein the cancer is a blood cancer; optionally wherein the blood cancer is leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, an AIDS- related lymphoma, multiple myeloma, plasmacytoma, post-transplantation lymphoproliferative disorder, or Waldenstrom macroglobulinemia. 58. The method of any one of claims 47 to 57, further comprising selecting, recommending and/or administering a treatment to the patient, when the method classifies that the patient has cancer. 59. The method of claim 58, wherein the treatment comprises chemotherapy, radiation, surgery and/or any combination thereof.
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202363539073P | 2023-09-18 | 2023-09-18 | |
| US63/539,073 | 2023-09-18 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025064547A1 true WO2025064547A1 (en) | 2025-03-27 |
Family
ID=93011056
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2024/047289 Pending WO2025064547A1 (en) | 2023-09-18 | 2024-09-18 | Machine learning classification of solid tumors based on gene expression |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025064547A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2020102043A1 (en) * | 2018-11-15 | 2020-05-22 | Ampel Biosolutions, Llc | Machine learning disease prediction and treatment prioritization |
| US20220328134A1 (en) * | 2021-03-31 | 2022-10-13 | PrognomIQ, Inc. | Multi-omic assessment |
| WO2023150883A1 (en) * | 2022-02-11 | 2023-08-17 | The Hospital For Sick Children | System and method for classifying cancer and classifying benign and malignant neoplasm |
-
2024
- 2024-09-18 WO PCT/US2024/047289 patent/WO2025064547A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2020102043A1 (en) * | 2018-11-15 | 2020-05-22 | Ampel Biosolutions, Llc | Machine learning disease prediction and treatment prioritization |
| US20220328134A1 (en) * | 2021-03-31 | 2022-10-13 | PrognomIQ, Inc. | Multi-omic assessment |
| WO2023150883A1 (en) * | 2022-02-11 | 2023-08-17 | The Hospital For Sick Children | System and method for classifying cancer and classifying benign and malignant neoplasm |
Non-Patent Citations (2)
| Title |
|---|
| AU QUAY ET AL: "Grouped feature importance and combined features effect plot", JOURNAL OF DATA MINING AND KNOWLEDGE DISCOVERY, NORWELL, MA, US, vol. 36, no. 4, 18 June 2022 (2022-06-18), pages 1401 - 1450, XP037910766, ISSN: 1384-5810, [retrieved on 20220618], DOI: 10.1007/S10618-022-00840-5 * |
| MCKUSICK-NATHANS: "National Center for Biotechnology Information gene database", JOHNS HOPKINS UNIVERSITY SCHOOL OF MEDICINE |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10325673B2 (en) | Deep transcriptomic markers of human biological aging and methods of determining a biological aging clock | |
| Shi et al. | Semi-supervised learning improves gene expression-based prediction of cancer recurrence | |
| US10665326B2 (en) | Deep proteome markers of human biological aging and methods of determining a biological aging clock | |
| JP7497084B2 (en) | Systems and methods for predicting efficacy of cancer treatments - Patents.com | |
| US20200286625A1 (en) | Biological data signatures of aging and methods of determining a biological aging clock | |
| US20190030078A1 (en) | Multi-stage personalized longevity therapeutics | |
| EP3970150A1 (en) | Deep proteome markers of human biological aging and methods of determining a biological aging clock | |
| CN110305965A (en) | A method of sensibility of prediction non-small cell lung cancer (NSCLC) patient to immunotherapy | |
| JP2022511243A (en) | Transcription factor profiling | |
| AU2021227229A1 (en) | Methods of analyzing cell free nucleic acids and applications thereof | |
| Rawat et al. | Cancer malignancy prediction using machine learning: a cross-dataset comparative study | |
| Andreini et al. | MicroRNA signature for interpretable breast cancer classification with subtype clue | |
| Patil et al. | Role of artificial intelligence in cancer detection using protein p53: A Review | |
| Singireddy et al. | Identifying differentially expressed transcripts associated with prostate cancer progression using RNA-Seq and machine learning techniques | |
| EP4605937A1 (en) | Method of determining loss of heterozygosity status of a tumor | |
| WO2025064547A1 (en) | Machine learning classification of solid tumors based on gene expression | |
| US20250069696A1 (en) | Method and apparatus for detecting minimal residual disease using tumor information | |
| WO2024031097A2 (en) | Systems and methods for cancer screening | |
| WO2019190732A1 (en) | Apparatus and method for identification of primary immune resistance in cancer patients | |
| EP4244374A1 (en) | Cancer diagnosis and classification by non-human metagenomic pathway analysis | |
| EP4272224A1 (en) | Machine learning classification of lung nodules based on gene expression | |
| WO2021041968A1 (en) | Systems and methods for predicting and monitoring treatment response from cell-free nucleic acids | |
| WO2024259316A2 (en) | Tumor identification and classification using fragmentomic features | |
| WO2025080809A1 (en) | Disease classification using fragment images | |
| Padron-Manrique et al. | Domain-Adversarial Neural Network and Explainable AI for Reducing Tissue-of-Origin Signal in Pan-cancer Mortality Classification |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24786276 Country of ref document: EP Kind code of ref document: A1 |