WO2025114824A1

WO2025114824A1 - Fusion genes associated with hepatocellular carcinoma and related methods

Info

Publication number: WO2025114824A1
Application number: PCT/IB2024/061672
Authority: WO
Inventors: Jianhua Luo; David A. Geller; Shuchang LIU; Yanping Yu
Original assignee: University of Pittsburgh
Current assignee: University of Pittsburgh
Priority date: 2023-11-28
Filing date: 2024-11-21
Publication date: 2025-06-05
Anticipated expiration: 2026-05-28

Abstract

Provided herein is a computer-implemented method including receiving, with at least one processor, training data, the training data comprising genomic data and diagnostic data from a plurality of patients, determining, with at least one processor and based at least on the genomic data and the diagnostic data, an association between the genomic data and the diagnosis data, training, with at least one processor and based on the association, a machine learning model, a machine learning model configured to generate an output, the output comprising a prediction of a patient diagnosis.

Description

FUSION GENES ASSOCIATED WITH HEPATOCELLULAR CARCINOMA AND RELATED METHODS

CROSS REFERENCE TO RELATED APPLICATION

[0001] This application claims priority to United States Provisional Patent Application No. 63/603,240 filed November 28, 2023, which is hereby incorporated by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

[0002] This invention was made with government support under CA229262 awarded by the National Institutes of Health. The government has certain rights in the invention.

REFERENCE TO A “SEQUENCING LISTING”

[0003] The Sequence Listing associated with this application is filed in electronic format via Patent Center and is hereby incorporated by reference into the specification in its entirety. The name of the file containing the Sequence Listing is 2405895. xml. The size of the file is 38,710 bytes, and the file was created on November 15, 2024.

BACKGROUND OF THE INVENTION

Field of the Invention

[0004] Provided herein are systems and methods for early and accurate diagnosis of biological conditions, including hepatocellular carcinoma.

Description of Related Art

[0005] Liver cancer is one of the most fatal malignancies in humans. Worldwide, liver cancer causes >830,000 deaths per year. Hepatocellular carcinoma (HCC) is the most common type of liver cancer, accounting for 90% of all liver cancers. In the early stages, liver cancer can be cured through surgical and radiologic intervention. Unfortunately, most liver cancers are insidious and asymptomatic. As a result, over 60% of patients with HCC present in late-stage disease. The options for treating latestage HCC are limited. Advanced-stage liver cancer is associated with a high mortality rate. The mean survival time in patients with late-stage HCC is 6 to 20 months. The 2- year survival rate among patients with late-stage HCC is <50%, while the 5-year survival rate is only 10%. Thus, the identification of early HCC is crucial in reducing the risk for liver cancer-related death. [0006] Currently, serum a-fetal protein (AFP) screening is the most commonly used tool to screen for HCC. However, the accuracy of AFP appears to vary by study, and the cutoff of AFP in the prediction of HCC has ranged widely. The AFP screening test for HCC tumors measuring <5 cm in diameter has a range of sensitivity of 49% to 71 % and a specificity of 49% to 86%. Thus, a more sensitive tool for early HCC detection is needed.

SUMMARY OF THE INVENTION

[0007] Provided herein is a computer-implemented method including receiving, with at least one processor, training data, the training data including genomic data and diagnostic data from a plurality of patients, wherein the genomic data comprises fusion gene data including data relating to presence of one or more of CCNH-C5orf30, SLC45A2-AMACR, PTEN-NOLC1 , ZMPSTE24-ZMYM4, PTEN-NOLC1 , STAMBPL1 - FAS, and PCMTD1 -SNTG1 in a sample obtained from the plurality of patients; determining, with at least one processor and based at least on the genomic data and the diagnostic data, an association between the genomic data and the diagnosis data; and training, with at least one processor and based on the association, a machine learning model, a machine learning model configured to generate an output, the output including a prediction of a patient diagnosis.

[0008] Also provided herein is a computer-implemented method, including receiving, with at least one processor, training data, the training data including fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based on the association, a logistic regression machine learning model, the machine learning model configured to generate an output including a prediction of whether a patient has hepatocellular carcinoma.

[0009] Also provided herein is a computer-implemented method of determining that a patient has hepatocellular carcinoma, including applying a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by receiving, with at least one processor, training data, the training data including fusion gene data relating to presence of ZMPSTE24-ZMYM4 and CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate an output including a prediction of whether the patient has hepatocellular carcinoma, and generating, with at least one processor and based at least in part on application of the trained logistic regression machine learning model to the patient serum AFP data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.

[0010] Also provided herein is a method of treating a patient having hepatocellular carcinoma, including diagnosing a patient with hepatocellular carcinoma, wherein the diagnosis is based at least in part on application of a trained logistic regression machine learning model to serum fusion gene data and serum protein data obtained from the patient, the machine learning model trained by receiving, with at least one processor, training data, the training data including fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate an output including a prediction of whether the patient has hepatocellular carcinoma; and treating the patient, wherein treating the patient comprises at least one of gene therapy, immunotherapy, irradiation of cancerous tissue, administration of one or more chemotherapeutic agents, and/or surgical resection of cancerous tissue.

[0011 ] Also provided herein is a system including at least one processor and a non- transitory computer-readable medium storing programming instructions configured to cause the at least one processor to receive training data, the training data including genomic data and diagnostic data from a plurality of patients; determine an association between the genomic data and the diagnostic data; and train, based at least on the association, a machine learning model, the machine learning model configured to generate an output including a prediction of a patient diagnosis.

[0012] Also provided herein is a system including at least one processor and a non- transitory computer-readable medium storing programming instructions configured to cause the at least one processor to apply a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by receiving, with at least one processor, training data including fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH- C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate output including a prediction of whether the patient has hepatocellular carcinoma, generate, based on application of the trained logistic regression machine learning model to the patient serum protein data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.

[0013] Also provided herein is a computer-readable, non-transitory medium having stored thereon programming instructions that when executed by a processor cause the processor to: apply a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by receiving, with at least one processor, training data including fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate output including a prediction of whether the patient has hepatocellular carcinoma; and generate, based on application of the trained logistic regression machine learning model to the patient serum protein data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.

[0014] The following numbered clauses outline various aspects or embodiments of the present invention:

[0015] 1. A computer-implemented method comprising: receiving, with at least one processor, training data, the training data comprising genomic data and diagnostic data from a plurality of patients, wherein the genomic data comprises fusion gene data comprising data relating to presence of one or more of CCNH-C5orf30, SLC45A2- AMACR, PTEN-NOLC1 , ZMPSTE24-ZMYM4, PTEN-NOLC1 , STAMBPL1 -FAS, and PCMTD1 -SNTG1 in a sample obtained from the plurality of patients; determining, with at least one processor and based at least on the genomic data and the diagnostic data, an association between the genomic data and the diagnosis data; and training, with at least one processor and based on the association, a machine learning model, a machine learning model configured to generate an output, the output comprising a prediction of a patient diagnosis.

[0016] 2. The computer-implemented method of clause 1 , wherein the machine learning model is a support vector machine.

[0017] 3. The computer-implemented method of clause 1 or clause 2, wherein the machine learning model is a random forest.

[0018] 4. The computer-implemented method of any of clauses 1 -3, wherein the machine learning model is a linear discriminant analysis.

[0019] 5. The computer-implemented method of any of clauses 1 -4, wherein the machine learning model is a logistic regression.

[0020] 6. The computer-implemented method of any of clauses 1 -5, wherein the genomic data comprises fusion gene data.

[0021] 7. The computer-implemented method of any of clauses 1 -6, wherein the fusion gene data is selected based on a cycle threshold of less than or equal to 45.

[0022] 8. The computer-implemented method of any of clauses 1 -7, wherein the cycle threshold is obtained using a real-time quantitative polymerase chain reaction (RT-PCR) assay. [0023] 9. The computer-implemented method of any of clauses 1 -8, wherein the RT-PCR assay utilizes at least one primer, wherein the at least one primer is at least one of SEQ ID NOS:1 , 2, 4, 5, 7, 8, 10, 11 , 13, 14, 16, 17, 19, 20, 22, 23, 25, 26, 28, and 29.

[0024] 10. The computer-implemented method of any one of claims 1 -9, wherein the training data further comprises serum protein data from the plurality of patients, and wherein: the determining step comprises determining, with at least one processor and based at least on the genomic data and the serum protein data, an association between the genomic data and the serum protein data and the diagnostic data.

[0025] 1 1 . The computer-implemented method of any of clauses 1 -10, wherein the serum protein data comprises serum a-fetoprotein (AFP) data.

[0026] 12. The computer-implemented method of any of clauses 1 -11 , wherein the serum AFP data comprises a comparison of an amount of AFP in a sample obtained from a patient compared to a threshold value.

[0027] 13. The computer-implemented method of any of clauses 1 -12, wherein the threshold value is 200 ng/mL.

[0028] 14. The computer-implemented method of any of clauses 1 -13, wherein the threshold value is 400 ng/mL.

[0029] 15. The computer-implemented method of any of clauses 1 -14, wherein the genomic data comprises fusion gene data comprising data relating to presence of one or both of CCNH-C5orf30 and ZMPSTE24-ZMYM4 in the sample obtained from the plurality of patients.

[0030] 16. A computer-implemented method comprising: receiving, with at least one processor, training data, the training data comprising: fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based on the association, a logistic regression machine learning model, the machine learning model configured to generate an output comprising a prediction of whether a patient has hepatocellular carcinoma. [0031] 17. A computer-implemented method of determining that a patient has hepatocellular carcinoma comprising: applying a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by: receiving, with at least one processor, training data, the training data comprising: fusion gene data relating to presence of ZMPSTE24-ZMYM4 and CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate an output comprising a prediction of whether the patient has hepatocellular carcinoma, and generating, with at least one processor and based at least in part on application of the trained logistic regression machine learning model to the patient serum AFP data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.

[0032] 18. The computer-implemented method of clause 17, wherein the patient is suspected of having hepatocellular carcinoma.

[0033] 19. The computer-implemented method of clause 17 or clause 18, wherein the patient has previously been diagnosed with hepatocellular carcinoma and has previously undergone treatment for hepatocellular carcinoma.

[0034] 20. The computer-implemented method of any of clauses 17-19, further comprising receiving, with at least one processor, imaging data relating to the patient’s liver.

[0035] 21. The computer-implemented method of any of clauses 17-20, further comprising generating, based on the patient serum protein data, the patient serum fusion gene data, and the imaging data, a treatment plan for the patient.

[0036] 22. A method of treating a patient having hepatocellular carcinoma, comprising: diagnosing a patient with hepatocellular carcinoma, wherein the diagnosis is based at least in part on application of a trained logistic regression machine learning model to serum fusion gene data and serum protein data obtained from the patient, the machine learning model trained by receiving, with at least one processor, training data, the training data comprising: fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate an output comprising a prediction of whether the patient has hepatocellular carcinoma; and treating the patient, wherein treating the patient comprises at least one of gene therapy, immunotherapy, irradiation of cancerous tissue, administration of one or more chemotherapeutic agents, and/or surgical resection of cancerous tissue.

[0037] 23. A system comprising at least one processor and a non-transitory computer-readable medium storing programming instructions configured to cause the at least one processor to: receive training data, the training data comprising genomic data and diagnostic data from a plurality of patients; determine an association between the genomic data and the diagnostic data; and train, based at least on the association, a machine learning model, the machine learning model configured to generate an output comprising a prediction of a patient diagnosis.

[0038] 24. The system of clause 23, wherein the machine learning model is a support vector machine.

[0039] 25. The system of clause 23 or clause 24, wherein the machine learning model is a random forest.

[0040] 26. The system of any of clauses 23-25, wherein the machine learning model is a linear discriminant analysis.

[0041] 27. The system of any of clauses 23-26, wherein the machine learning model is a logistic regression.

[0042] 28. The system of any of clauses 23-27, wherein the genomic data comprises fusion gene data.

[0043] 29. The system of any of clauses 23-28, wherein the fusion gene data is selected based on a cycle threshold of less than or equal to 45.

[0044] 30. The system of any of clauses 23-29, wherein the training data further comprises serum protein data from the plurality of patients, and wherein: the determining step comprises determining, with at least one processor and based at least on the genomic data and the serum protein data, an association between the genomic data, the serum protein data, and the diagnostic data.

[0045] 31. The system of any of clauses 23-30, wherein the serum protein data comprises serum AFP data.

[0046] 32. The system of any of clauses 23-31 , wherein the serum AFP data comprises a comparison of an amount of AFP in a sample obtained from a patient to a threshold value.

[0047] 33. The system of any of clauses 23-32, wherein the threshold value is 200 ng/mL.

[0048] 34. The system of any of clauses 23-33, wherein the threshold value is 400 ng/mL.

[0049] 35. The system of any of clauses 23-34, wherein the genomic data comprises fusion gene data comprising data relating to presence of one or more of MAN2A1 -FER, CCNH-C5orf30, SLC45A2-AMACR, PTEN-NOLC1 , ZMPSTE24- ZMYM4, PTEN-NOLC1 , STAMBPL1 -FAS, and PCMTD1 -SNTG1 in a sample obtained from a patient.

[0050] 36. A system comprising at least one processor and a non-transitory computer-readable medium storing programming instructions configured to cause the at least one processor to: apply a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by: receiving, with at least one processor, training data comprising: fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate output comprising a prediction of whether the patient has hepatocellular carcinoma, generate, based on application of the trained logistic regression machine learning model to the patient serum protein data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.

[0051] 37. The system of clause 36, wherein the patient serum protein data comprises a comparison of an amount of AFP in a sample obtained from the patient compared to a threshold value.

[0052] 38. The system of clause 36 or clause 37, wherein the threshold value is 200 ng/mL.

[0053] 39. The system of any of clause 36-38, wherein the threshold value is 400 ng/mL.

[0054] 40. The system of any of clauses 36-39, wherein the patient genomic data comprises fusion gene data comprising data relating to presence of one or more of CCNH-C5orf30 and/or ZMPSTE24-ZMYM4 in a sample obtained from the patient.

[0055] 41 . A computer-readable, non-transitory medium comprising programming instructions that when executed by a processor cause the processor to: apply a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by: receiving, with at least one processor, training data comprising: fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate output comprising a prediction of whether the patient has hepatocellular carcinoma; and generate, based on application of the trained logistic regression machine learning model to the patient serum protein data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.

BRIEF DESCRIPTION OF THE DRAWINGS

[0056] FIG. 1 is a flowchart of a non-limiting embodiment of a method as described herein. [0057] FIG. 2 is a schematic diagram of example components of one or more devices useful in non-limiting embodiments of systems and methods according to nonlimiting embodiments described herein.

[0058] FIGS. 3A-3B show distribution of fusion genes in serum samples.

[0059] FIG. 4 shows a serum fusion gene model to predict the occurrence of HCC. Receiver operating characteristic curve of the four fusion genes logistic regression model (MAN2A1 eFER 40, CCNHeC5orf30 38, SL45A2eAMACR 41 , and PTENeNOLCI 40) in the training cohort (samples collected before 2016; left), the testing cohort (samples collected after 2015; middle), and the combined cohorts (right). AUC, area under the receiver operating characteristic curve; LOOCV, leave- one-out cross-validation.

[0060] FIG. 5 shows serum fusion gene model to predict the occurrence of HCC. Kaplan-Meier analysis of the four fusion genes logistic regression model (MAN2A1 eFER 40, CCNHeC5orf30 38, SL45A2eAMACR 41 , and PTENeNOLCI 40) in the training cohort. LOOCV, leave-one-out cross-validation.

[0061] FIG. 6 shows impact of serum a-fetoprotein (AFP) on the survival of individuals with 400 ng/mL or 200 ng/mL cutoff. Kaplan-Meier analysis of serum AFP cutoffs of 400 ng/mL in the training cohort (left), the testing cohort (middle), and the combined cohorts (right).

[0062] FIG. 7 shows impact of serum a-fetoprotein (AFP) on the survival of individuals with 400 ng/mL or 200 ng/mL cutoff. Kaplan-Meier analysis of serum AFP cutoffs of 200 ng/mL in the training cohort (left), the testing cohort (middle), and the combined cohorts (right).

[0063] FIG. 8 shows serum fusion genes and serum a-fetoprotein (AFP) integration model to predict the occurrence of HCC. Receiver operating characteristic curve of the two fusion genes (MAN2A1 eFER 40, CCNHeC5orf30 38) |D AFP model in the training cohort (samples collected before 2016; left), the testing cohort (samples collected after 2015; middle), and the combined cohorts (right). AUC, area under the receiver operating characteristic curve; LOOCV, leave-one-out cross-validation.

[0064] FIG. 9 shows serum fusion genes and serum a-fetoprotein (AFP) integration model to predict the occurrence of HCC. Kaplan-Meier analysis of the two fusion genes (MAN2A1 eFER 40, CCNHeC5orf30 38) |D AFP model in the training cohort (samples collected before 2016; left), the testing cohort (samples collected after 2015; middle), and the combined cohorts (right). LOOCV, leave-one-out cross-validation. [0065] FIGS. 10A-10I show the treatment impact on serum fusion transcript levels. DCT was calculated as the difference between the CT of a RT-qPCR of a fusion transcript and the maximal RT-QPCR cycle (50) of a test. Cases with both pretreatment and post-treatment samples negative for the fusion transcripts were not included in the analysis. The individual cases are indicated in each plot.

DESCRIPTION OF THE INVENTION

[0066] Other than in the operating examples, or where otherwise indicated, all numbers expressing quantities of ingredients, reaction conditions, and so forth used in the specification and claims are to be understood as being modified in all instances by the term "about". Accordingly, unless indicated to the contrary, the numerical parameters set forth in the following specification and attached claims are approximations that may vary depending upon the desired properties sought to be obtained by the present invention. At the very least, and not as an attempt to limit the application of the doctrine of equivalents to the scope of the claims, each numerical parameter should at least be construed in light of the number of reported significant digits and by applying ordinary rounding techniques.

[0067] Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical values, however, inherently contain certain errors necessarily resulting from the standard deviation found in their respective testing measurements. Furthermore, when numerical ranges of varying scope are set forth herein, it is contemplated that any combination of these values inclusive of the recited values may be used.

[0068] Also, it should be understood that any numerical range recited herein is intended to include all sub-ranges subsumed therein. For example, a range of "1 to 10" is intended to include all sub-ranges between and including the recited minimum value of 1 and the recited maximum value of 10, that is, having a minimum value equal to or greater than 1 and a maximum value of equal to or less than 10.

[0069] As used herein “a” and “an” refer to one or more.

[0070] As used herein, the term “comprising” is open-ended and may be synonymous with “including”, “containing”, or “characterized by”. [0071] As used herein, the term “patient” or “subject” refers to members of the animal kingdom including but not limited to human beings, and “mammal” refers to all mammals, including, but not limited to human beings.

[0072] For purposes of the description hereinafter, the terms “end,” “upper,” “lower,” “right,” “left,” “vertical,” “horizontal,” “top,” “bottom,” “lateral,” “longitudinal,” and derivatives thereof shall relate to the embodiments as they are oriented in the drawing figures. However, it is to be understood that the present disclosure may assume various alternative variations and step sequences, except where expressly specified to the contrary. It is also to be understood that the specific devices and processes illustrated in the attached drawings, and described in the following specification, are simply exemplary and non-limiting embodiments or aspects of the disclosed subject matter. Hence, specific dimensions and other physical characteristics related to the embodiments or aspects disclosed herein are not to be considered as limiting.

[0073] Some non-limiting embodiments or aspects are described herein in connection with thresholds. As used herein, satisfying a threshold may refer to a value being greater than the threshold, more than the threshold, higher than the threshold, greater than or equal to the threshold, less than the threshold, fewer than the threshold, lower than the threshold, less than or equal to the threshold, equal to the threshold, etc.

[0074] No aspect, component, element, structure, act, step, function, instruction, and/or the like used herein should be construed as critical or essential unless explicitly described as such. Also, as used herein, the articles “a” and “an” are intended to include one or more items and may be used interchangeably with “one or more” and “at least one.” Furthermore, as used herein, the term “set” is intended to include one or more items (e.g., related items, unrelated items, a combination of related and unrelated items, and/or the like) and may be used interchangeably with “one or more” or “at least one.” Where only one item is intended, the term “one” or similar language is used. Also, as used herein, the terms “has,” “have,” “having,” or the like are intended to be open-ended terms. Further, the phrase “based on” is intended to mean “based at least partially on” unless explicitly stated otherwise. In addition, reference to an action being “based on” a condition may refer to the action being “in response to” the condition. For example, the phrases “based on” and “in response to” may, in some non-limiting embodiments or aspects, refer to a condition for automatically triggering an action (e.g., a specific operation of an electronic device, such as a computing device, a processor, and/or the like).

[0075] As used herein, the term “communication” may refer to the reception, receipt, transmission, transfer, provision, and/or the like of data (e.g., information, signals, messages, instructions, commands, and/or the like). For one unit (e.g., a device, a system, a component of a device or system, combinations thereof, and/or the like) to be in communication with another unit means that the one unit is able to directly or indirectly receive information from and/or transmit information to the other unit. This may refer to a direct or indirect connection (e.g., a direct communication connection, an indirect communication connection, and/or the like) that is wired and/or wireless in nature. Additionally, two units may be in communication with each other even though the information transmitted may be modified, processed, relayed, and/or routed between the first and second unit. For example, a first unit may be in communication with a second unit even though the first unit passively receives information and does not actively transmit information to the second unit. As another example, a first unit may be in communication with a second unit if at least one intermediary unit processes information received from the first unit and communicates the processed information to the second unit. In some non-limiting embodiments or aspects, a message may refer to a network packet (e.g., a data packet and/or the like) that includes data. It will be appreciated that numerous other arrangements are possible. Communication may include one or more wired and/or wireless networks. For example, communication may include a cellular network (e.g., a long-term evolution (LTE) network, a third-generation (3G) network, a fourth-generation (4G) network, a fifth-generation (5G) network, a code division multiple access (CDMA) network, etc.), a public land mobile network (PLMN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a telephone network (e.g., the public switched telephone network (PSTN) and/or the like), a private network, an ad hoc network, an intranet, the Internet, a fiber optic-based network, a cloud computing network, and/or the like, and/or a combination of some or all of these or other types of networks.

[0076] As used herein, the term “computing device” may refer to one or more electronic devices configured to process data. A computing device may, in some examples, include the necessary components to receive, process, and output data, such as a processor, a display, a memory, an input device, a network interface, and/or the like. A computing device may be a mobile device. As an example, a mobile device may include a cellular phone (e.g., a smartphone or standard cellular phone), a portable computer, a wearable device (e.g., watches, glasses, lenses, clothing, and/or the like), a personal digital assistant (PDA), and/or other like devices. A computing device may also be a desktop computer or other form of non-mobile computer.

[0077] As used herein, the term “server” may refer to or include one or more computing devices that are operated by or facilitate communication and processing for multiple parties in a network environment, such as the Internet, although it will be appreciated that communication may be facilitated over one or more public or private network environments and that various other arrangements are possible. Further, multiple computing devices (e.g., servers, mobile devices, etc.) directly or indirectly communicating in the network environment may constitute a “system.”

[0078] As used herein, the term “system” may refer to one or more computing devices or combinations of computing devices (e.g., processors, servers, client devices, software applications, components of such, and/or the like). Reference to “a device,” “a server,” “a processor,” and/or the like, as used herein, may refer to a previously-recited device, server, or processor that is recited as performing a previous step or function, a different device, server, or processor, and/or a combination of devices, servers, and/or processors. For example, as used in the specification and the claims, a first device, a first server, or a first processor that is recited as performing a first step or a first function may refer to the same or different device, server, or processor recited as performing a second step or a second function.

[0079] As used herein, “machine learning” may refer to a field of computer science that uses statistical techniques to provide a computer system with the ability to learn (e.g., to progressively improve performance of) a task with data without the computer system being explicitly programmed to perform the task. In some instances, a machine learning model may be developed for a set of data so that the machine learning model may perform a task (e.g., a task associated with a prediction) with regard to the set of data.

[0080] In some instances, a machine learning model, such as a predictive machine learning model, may be used to make a prediction regarding a risk or an opportunity based on a large amount of data (e.g., a large-scale dataset). A predictive machine learning model may be used to analyze a relationship between the performance of a unit based on a large-scale dataset associated with the unit and one or more known features of the unit. The objective of the predictive machine learning model may be to assess the likelihood that a similar unit will exhibit the same or similar performance as the unit. In order to generate the predictive machine learning model, the large-scale dataset may be segmented so that the predictive machine learning model may be trained on data that is appropriate.

[0081 ] In general, machine learning builds new and/or leverages existing algorithms to learn from data, in order to build generalizable models that give accurate predictions, or to find patterns, particularly with new and unseen similar data. Machine learning takes training data sets and, through use of, for example, statistical and mathematical modeling algorithms, is able to provide an output that either classifies data to provide a specified output or clusters data for discovering the composition and structure of the data. In the context of the present disclosure, the machine learning component may conduct classification predictive modeling or regression predictive modeling to produce output, such as survival at a given time point in the future, risk of metastasis, expected lifespan, such as a Kaplan Meier survival curve or estimation, or a classification of a group of patients into cohorts. Machine learning may include deep learning to progressively extract higher level features from the raw input.

[0082] Machine learning components or modules comprise one or more algorithms, as are broadly-known, which are selected and assembled into the larger machine learning module based upon a variety of factors, including, for example and without limitation, the type of data to be used for training and to be evaluated after training and the desired output of the machine learning component. As part of the training process, the data set is split, and part of the data set is used for training, and another part of the data set is used for testing the machine learning component. As seen in the examples below, the choice of model, along with the choice of data, will determine the strength of the machine learning component. For example, and without limitation, regression models include simple or multiple linear regression, decision tree or forest regression, artificial neural networks, ordinal regression, Poisson regression, or nearest neighbor methods. Other potentially useful models include Random Forest (RF) and support vector machine (SVM) model. Although specific models are described in the examples below as being useful in the methods and systems described herein, other models may be used to optimize the machine learning process (see, e.g., Wang, P. et al. (2017). Machine Learning for Survival Analysis: A Survey. ACM Computing Surveys. 51. 10.1 145/3214306). In one non-limiting example, the machine learning module is implemented using lifelines, an implementation of survival analysis in Python. In another example, the machine learning is implemented in scikit- learn in Python. In one example, the machine learning applies Cox regression, support vector machine (SVM), or random forest (RF) to the features, e.g., characterizing sequence variation, such as single nucleotide variations, copy-number alterations, and/or structural variations in the tumor cell sequences, such as the rates of accumulation of such features overall, or at any stage in the tumor phylogenic tree. The machine learning methods may include variants with regularization or feature selection, techniques that may increase robustness by identifying most informative subsets of the input feature set.

[0083] Provided herein are devices, systems, and methods for training and applying machine learning models to patient data to improve diagnostic accuracy. In particular, the devices, systems, and methods disclosed herein improve the speed and accuracy of diagnostic predictions of hepatocellular carcinoma, which reduces expenditures of resources, reduces the need for invasive procedures such as biopsies, and allows for earlier interventions that can increase likelihood of survival. While predictions of diagnoses are exemplified below, those of skill will appreciate that the methods, devices, and systems disclosed herein may be useful for other purposes, including to determine whether a liver biopsy is necessary in patients, to evaluate cancer of the liver in patients with normal levels biomarker levels (such as AFP levels), to assess the efficacy of treatment(s) and the presence of residual cancer in patients who have undergone treatment, and as a routine screening test in high-risk individuals before any expensive radiology imaging is performed.

[0084] Turning to FIG. 1 , methods as disclosed herein may include steps of receiving, with a device or system as described herein (for example as shown in FIG. 2, described herein below), training data (e.g., a training dataset), training a machine learning model with such training data to provide a trained machine learning model, and applying the trained machine learning model to new data. In non-limiting embodiments, the machine learning model may provide a prediction as to a diagnosis of a condition in a patient, for example a diagnosis of hepatocellular carcinoma, based on the training dataset.

[0085] In some non-limiting embodiments, the machine learning model may include a machine learning model designed to receive, as an input, biological data and provide, as an output, a diagnosis. For example, the machine learning model may be designed to receive genomic data and serum protein data, and generate an output that includes the predicted diagnosis.

[0086] As used herein, “genomic data” means data relating to one or more aspects of a patient’s genome, which may be obtained and/or derived from a sample obtained by a patient, for example a serum sample. Genomic data may include, but is not limited to, data relating to or derived from a full genome sequencing assay, an identification and/or quantification of one or more single nucleotide polymorphisms (SNPs) in the patient’s genome, data obtainable from polymerase chain reaction (PCR) assays, for example real-time PCR (RT-PCR) and real-time quantitative PCR (RT-qPCR), and/or an identification and/or quantification of fusion genes in a sample obtained from a patient. In non-limiting embodiments, the genomic data is the identification and/or quantification of the presence of one or more fusion genes in a sample obtained from a patient. As used herein, the term “fusion gene” encompasses fusion gene transcripts (e.g., mRNA of a fusion gene) and means a hybrid gene based on the joining of two or more different genes within a patient’s genome.

[0087] In non-limiting embodiments, the genomic data is the identification and/or quantification of fusion genes in a sample obtained from a patient, and is based on an assay, such as a RT-qPCR assay. In non-limiting embodiments, the presence of a fusion gene is determined based on a cycle threshold of a RT-qPCR assay. As used herein, the term “cycle threshold” means the number of reaction cycles of a PCR assay, such as a RT-qPCR assay, until an amount of fluorescent signal generated by the reaction reaches a predetermined threshold. As those of skill in the art will appreciate, a lower cycle threshold value indicates a greater amount of genomic material, for example a greater amount of fusion gene transcripts, and thus the greater the amount of fusion genes present.

[0088] As used herein, “serum protein data” means data relating to the presence and/or levels of one or more proteins in a sample, such as a serum sample, obtained from a patient. Numerous assays are known and are commercially available for detecting and/or quantifying an amount of a protein in a sample, for example a serum sample, obtained from a patient, and include, without limitation, Bradford tests, Lowry tests, BCA tests, immunoassays (including enzyme-linked immunosorbent assays (ELISA)), Western Blots, and the like known to those of skill in the art. In non-limiting embodiments, the serum protein data is data relating to the presence and/or levels of a-fetoprotein (AFP) in a sample obtained from a patient. [0089] In some non-limiting embodiments, a device or system as described herein may process genomic data, serum protein data, and diagnostic data obtained from patients during a time interval, which may be referred to as historical patient data, to obtain training data (e.g., a training dataset) for the machine learning model. As used herein, “diagnostic data” means data relating to a patient diagnosis. The term may include, without limitation, a binary value (e.g., the patient has or does not have a given condition, the patient is in remission or has stable disease, the patient has stable disease or metastatic disease, and/or the like), and/or a continuous value (e.g., where a disease or condition may have a plurality of stages). In non-limiting embodiments, the historical patient data includes genomic data, serum protein data, and diagnostic data from the same patient. That is, the training data set may include a plurality of values for a plurality of patients.

[0090] For example, a device or system as described herein may process the data to change the data into a format that may be analyzed to generate the machine learning model. The data that is changed (e.g., the data that results from the change) may be referred to as training data. In some non-limiting embodiments, a device or system as described herein may process the genomic data, serum protein data, and/or diagnostic data (in non-limiting embodiments, the genomic data, the serum protein data, and the diagnostic data) to obtain the training data based on receiving the data. [0091] In some non-limiting embodiments, a device or system as described herein may process genomic data, serum protein data, and diagnostic data by determining a prediction variable based on the data. A prediction variable may include a metric, associated with genomic data, serum protein data, and diagnostic data, which may be derived based on historical patient data. The prediction variable may be analyzed to generate a machine learning model as described herein. For example, the prediction variable may include a variable associated with presence of one or more genetic markers in a sample obtained from a plurality of patients, for example one or more fusion genes, a variable associated with an amount of a protein-based biomarker in a sample obtained from the plurality of patients, for example an amount of AFP, and a variable associate with a diagnosis associated with the plurality of patients, for example a diagnosis of hepatocellular carcinoma, and/or the like.

[0092] In some non-limiting embodiments, a device or system as described herein, which may be referred to as a diagnostic prediction system, may analyze the training data to generate the machine learning model described herein, which may be referred to as a diagnostic predictive machine learning model. For example, the diagnostic prediction system may use machine learning techniques to analyze the training data to generate the diagnostic predictive machine learning model. In some non-limiting embodiments, generating the diagnostic predictive machine learning model (e.g., based on training data obtained from historical patient data) may be referred to as training the diagnostic predictive machine learning model. The machine learning techniques may include, for example, supervised and/or unsupervised techniques, such as decision trees, random forests, logistic regressions, linear regression, linear discriminant analysis gradient boosting, support-vector machines, extra-trees (e.g., an extension of random forests), Bayesian statistics, learning automata, Hidden Markov Modeling, linear classifiers, quadratic classifiers, association rule learning, and/or the like. In some non-limiting embodiments, the diagnostic predictive machine learning model may include a model that is specific to a particular characteristic or parameter, for example, a model that is specific to a particular patient location, particular patient medical history, and/or the like.

[0093] Additionally, or alternatively, when analyzing the training data, the diagnostic prediction system may identify one or more variables (e.g., one or more independent variables) as predictor variables (e.g., features) that may be used to make a prediction when analyzing the training data. In some non-limiting embodiments, values of the predictor variables may be inputs to the diagnostic predictive machine learning model. For example, diagnostic prediction system may identify a subset (e.g., a proper subset) of the variables as the predictor variables that may be used to accurately predict a diagnosis. In some non-limiting embodiments, the predictor variables may include one or more of the prediction variables, as discussed above, that have a significant impact (e.g., an impact satisfying a threshold) on a predicted diagnosis as determined by the diagnostic prediction system.

[0094] In some non-limiting embodiments, the diagnostic prediction system may validate the diagnostic predictive machine learning model. For example, the diagnostic prediction system may validate the diagnostic predictive machine learning model after the diagnostic prediction system generates the diagnostic predictive machine learning model. In some non-limiting embodiments, the diagnostic prediction system may validate the diagnostic predictive machine learning model based on a portion of the training data to be used for validation. For example, the diagnostic prediction system may partition the training data into a first portion and a second portion, where the first portion may be used to generate the diagnostic predictive machine learning model, as described above. In this example, the second portion of the training data (e.g., the validation data) may be used to validate the diagnostic predictive machine learning model.

[0095] In some non-limiting embodiments, diagnostic prediction system may validate the diagnostic predictive machine learning model by providing validation data associated with a patient (e.g., genomic data, serum protein data, and/or diagnostic data =) as input to the diagnostic predictive machine learning model, and determining, based on an output of the diagnostic predictive machine learning model, whether the diagnostic predictive machine learning model correctly, or incorrectly, predicted a diagnosis. In some non-limiting embodiments, the diagnostic prediction system may validate the diagnostic predictive machine learning model based on a validation threshold. For example, diagnostic prediction system may be configured to validate the diagnostic predictive machine learning model when the diagnosis (as identified by the validation data) is correctly predicted by the diagnostic predictive machine learning model (e.g., when the diagnostic predictive machine learning model correctly predicts the diagnosis XX% of the time).

[0096] In some non-limiting embodiments, if the diagnostic prediction system does not validate the diagnostic predictive machine learning model (e.g., when a percentage of correctly predicted diagnoses does not satisfy the validation threshold), then the diagnostic prediction system may generate one or more additional diagnostic predictive machine learning models.

[0097] In some non-limiting embodiments, once the diagnostic predictive machine learning model has been validated, the diagnostic prediction system may further train the diagnostic predictive machine learning model and/or generate new diagnostic predictive machine learning models based on receiving new data during application of the trained diagnostic predictive learning machine model. The new data may include additional data associated with one or more new patients, and may be considered new training data. In some non-limiting embodiments, the new training data may include new genomic data, serum protein data, and/or diagnostic data. The diagnostic prediction system may use the diagnostic predictive machine learning model to predict a diagnosis and compare an output of the diagnostic predictive machine learning model(s) to the new training data. In such an example, the diagnostic prediction system may update one or more diagnostic predictive machine learning models based on the new training data.

[0098] In some non-limiting embodiments, the diagnostic prediction system may store the diagnostic predictive machine learning model. For example, diagnostic prediction system may store the diagnostic predictive machine learning model in a data structure (e.g., a database, a linked list, a tree, and/or the like). The data structure may be located within diagnostic prediction system or external (e.g., remote from) diagnostic prediction system.

[0099] Turning to FIG. 2, any computing device or system useful for training and application of machine learning models described herein, including the diagnostic prediction system as described herein, may have one or more elements of device 200 shown in FIG. 2, which shows a diagram of example components of a device 200 according to non-limiting embodiments. Device 200 may correspond to any element of a system. In some non-limiting embodiments, such systems or devices may include at least one device 200 and/or at least one component of device 200. The number and arrangement of components shown are provided as an example. In some non-limiting embodiments, device 200 may include additional components, fewer components, different components, or differently arranged components than those shown. Additionally, or alternatively, a set of components (e.g., one or more components) of device 200 may perform one or more functions described as being performed by another set of components of device 200.

[00100] As shown in FIG. 2, device 200 may include a bus 202, a processor 204, memory 206, a storage component 208, an input component 210, an output component 212, and a communication interface 214. Bus 202 may include a component that permits communication among the components of device 200. In some non-limiting embodiments, processor 204 may be implemented in hardware, firmware, or a combination of hardware and software. For example, processor 204 may include a processor (e.g., a central processing unit (CPU), a graphics processing unit (GPU), an accelerated processing unit (APU), etc.), a microprocessor, a digital signal processor (DSP), and/or any processing component (e.g., a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), etc.) that can be programmed to perform a function. Memory 206 may include random access memory (RAM), read only memory (ROM), and/or another type of dynamic or static storage device (e.g., flash memory, magnetic memory, optical memory, etc.) that stores information and/or instructions for use by processor 204.

[00101] With continued reference to FIG. 2, storage component 208 may store information and/or software related to the operation and use of device 200. For example, storage component 208 may include a hard disk (e.g., a magnetic disk, an optical disk, a magneto-optic disk, a solid-state disk, etc.) and/or another type of computer-readable medium. Input component 210 may include a component that permits device 200 to receive information, such as via user input (e.g., a touch screen display, a keyboard, a keypad, a mouse, a button, a switch, a microphone, etc.). Additionally, or alternatively, input component 210 may include a sensor for sensing information (e.g., a global positioning system (GPS) component, an accelerometer, a gyroscope, an actuator, etc.). Sensors useful here may include biochemical sensors, electrochemical sensors, sensors for detecting autonomic tone, sensors for detecting sympathetic tone, and/or the like. Output component 212 may include a component that provides output information from device 200 (e.g., a display, a speaker, one or more light-emitting diodes (LEDs), etc.). Communication interface 214 may include a transceiver-like component (e.g., a transceiver, a separate receiver and transmitter, etc.) that enables device 200 to communicate with other devices, such as via a wired connection, a wireless connection, or a combination of wired and wireless connections. Communication interface 214 may permit device 200 to receive information from another device and/or provide information to another device. For example, communication interface 214 may include an Ethernet interface, an optical interface, a coaxial interface, an infrared interface, a radio frequency (RF) interface, a universal serial bus (USB) interface, a Wi-Fi® interface, a cellular network interface, and/or the like.

[00102] Device 200 may perform one or more processes described herein. Device 200 may perform these processes based on processor 204 executing software instructions stored by a computer-readable medium, such as memory 206 and/or storage component 208. A computer-readable medium may include any non-transitory memory device. A memory device includes memory space located inside of a single physical storage device or memory space spread across multiple physical storage devices. Software instructions may be read into memory 206 and/or storage component 208 from another computer-readable medium or from another device via communication interface 214. When executed, software instructions stored in memory 206 and/or storage component 208 may cause processor 204 to perform one or more processes described herein. Additionally, or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to perform one or more processes described herein. Thus, embodiments described herein are not limited to any specific combination of hardware circuitry and software. The term “configured to,” as used herein, may refer to an arrangement of software, device(s), and/or hardware for performing and/or enabling one or more functions (e.g., actions, processes, steps of a process, and/or the like). For example, “a processor configured to” may refer to a processor that executes software instructions (e.g., program code) that cause the processor to perform one or more functions.

[00103] In non-limiting embodiments, provided herein is a computer-implement method that includes training a machine learning model for predicting a patient diagnosis. A computer-implemented method as described herein may be carried out with a device or system as described herein, including a device or system having components shown in FIG. 2 and described above. The machine learning model may be trained with genomic data and diagnostic data received from a plurality of patients (e.g., historical patient data), and such training may be based on the determination of an association between the genomic data and the diagnostic data (e.g., determining whether the genomic data is associated with a particular diagnosis). As used herein, the term “association” is meant to encompass an association and/or a correlation between two variables. A machine learning model trained according to such a method may generate an output. In non-limiting embodiments, the generated output is a prediction of whether a patient has a given condition or disease (e.g., a diagnosis).

[00104] In non-limiting embodiments, the machine learning model that is trained may be a support vector machine, a random forest, a linear discriminant analysis, and/or a logistic regression.

[00105] As described herein, in non-limiting embodiments the genomic data may be data relating to a presence and/or amount of fusion gene transcripts in a sample obtained from a patient, which may provide an indication of the presence of a fusion gene within the patient’s genome. In non-limiting embodiments, the fusion gene is at least one of MAN2A1 -FER, SLC45A2-AMACR, ZMPSTE24-ZMYM4, PTEN-NOLC1 , CCNH-C5orf30, STAMBPL1 -FAS, and/or PCMTD1 -SNTG1 . In non-limiting embodiments, the fusion gene is both ZMPSTE24-ZMYM4 and CCNH-C5orf30. In non-limiting embodiments, the presence of a fusion gene in the sample is determined based on a PCR assay, for example a RT-qPCR assay. In non-limiting embodiments, the presence of a fusion gene in the sample is based on a PCR assay and with a cycle threshold value of less than or equal to 10, 15, 20, 25, 30, 35, 40, 45, and/or 50, all values and subranges therebetween inclusive. In non-limiting embodiments, a PCR assay for detecting the presence of a fusion gene in a sample obtained from a patient may include one or more primers, for example one or more of SEQ ID NOS: 1 , 2, 4, 5, 7, 8, 10, 1 1 , 13, 14, 16, 17, 19, 20, 22, 23, 25, 26, 28, and 29.

[00106] In non-limiting embodiments, the historical patient data used as a training dataset for the machine learning model includes, in addition to genomic data and diagnostic data, serum protein data obtained from the same patients. In non-limiting embodiments, the serum protein data includes data relating to the presence and/or amount of AFP in a sample obtained from the patient. In non-limiting embodiments, the serum protein data is variable based on a comparison of the amount of AFP in the sample compared to a predetermined threshold. In non-limiting embodiments, the threshold is about 100 ng/mL, 150 ng/mL, 200 ng/mL, 250 ng/mL, 300 ng/mL, 350 ng/mL, 400 ng/mL, 450 ng/mL, and/or 500 ng/mL, all values and subranges therebetween inclusive. In non-limiting embodiments, the threshold value is about 200 ng/mL. In non-limiting embodiments, the threshold value is about 400 ng/mL.

[00107] In non-limiting embodiments, provided herein is a computer-implement method that includes training a logistic regression machine learning model for predicting a patient diagnosis. The machine learning model may be trained with genomic data relating to the presence and/or amount of the fusion genes ZMPSTE24- ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of patients, serum protein data relating to the presence and/or amount of AFP in a sample obtained from the patients, and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the patients (e.g., historical patient data), and such training may be based on the determination of an association between the genomic data, the serum protein data, and the diagnostic data (e.g., determining whether the genomic data and the serum protein data are associated with a particular diagnosis). A machine learning model trained according to such a method may generate an output. In non-limiting embodiments, the generated output is a prediction of whether a patient has hepatocellular carcinoma (e.g., a diagnosis).

[00108] In non-limiting embodiments, provided herein is a computer-implement method that includes training a logistic regression machine learning model for predicting a patient diagnosis, and applying the trained machine learning model. The machine learning model may be trained with a training dataset that comprises genomic data relating to the presence and/or amount of the fusion genes ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients, serum protein data relating to the presence and/or amount of AFP in a sample obtained from the plurality of historic patients, and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients (e.g., historical patient data), and such training may be based on the determination of an association between the genomic data, the serum protein data, and the diagnostic data (e.g., determining whether the genomic data and the serum protein data are associated with a particular diagnosis). A machine learning model trained according to such a method may generate an output. In non-limiting embodiments, the generated output is a prediction of whether a patient has hepatocellular carcinoma (e.g., a diagnosis). The trained logistic regression model may then be applied to data obtained from a patient, for example genomic data, for example data relating to the presence and/or amount of the fusion genes ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from the patient, and serum protein data, for example data relating to the presence and/or amount of AFP in a sample obtained from the patient. The application of the trained logistic regression model may generate a prediction of whether the patient has hepatocellular carcinoma (e.g., a diagnosis).

[00109] In non-limiting embodiments, the patient to whose data the trained machine learning model is applied may be a patient who has no history of hepatocellular carcinoma, a patient who has a history of hepatocellular carcinoma (e.g., a familial history, or a previous diagnosis), and/or a patient who is suspected of having hepatocellular carcinoma. In non-limiting embodiments, the patient has received a previous treatment for hepatocellular carcinoma.

[00110] In non-limiting embodiments, the method may include further steps. For example, following a prediction (generated by the trained machine learning model), one or more imaging techniques may be applied to the patient. For example, the patient’s liver may be imaged with positron emission tomography (PET), magnetic resonance imaging (MRI), ultrasound, and/or the like, to generate imaging data relating to a localization of a cancerous tissue within the liver. In non-limiting embodiments, the imaging data is utilized, in combination with the output of the machine learning model, to identify a stage of the diagnosis (e.g., a stage of hepatocellular carcinoma) and/or a treatment plan for the patient.

[00111] In non-limiting embodiments, provided herein is a method that includes training a logistic regression machine learning model for predicting a patient diagnosis, and applying the trained machine learning model. The machine learning model may be trained with a training dataset that comprises genomic data relating to the presence and/or amount of the fusion genes ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients, serum protein data relating to the presence and/or amount of AFP in a sample obtained from the plurality of historic patients, and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients (e.g., historical patient data), and such training may be based on the determination of an association between the genomic data, the serum protein data, and the diagnostic data (e.g., determining whether the genomic data and the serum protein data are associated with a particular diagnosis). A machine learning model trained according to such a method may generate an output. In non-limiting embodiments, the generated output is a prediction of whether a patient has hepatocellular carcinoma (e.g., a diagnosis). The trained logistic regression model may then be applied to data obtained from a patient, for example genomic data, for example data relating to the presence and/or amount of the fusion genes ZMPSTE24- ZMYM4 and/or CCNH-C5orf30 in a sample obtained from the patient, and serum protein data, for example data relating to the presence and/or amount of AFP in a sample obtained from the patient. The application of the trained logistic regression model may generate a prediction of whether the patient has hepatocellular carcinoma (e.g., a diagnosis). The method may further include a step of treating the patient with a treatment specific to hepatocellular carcinoma. In non-limiting embodiments, the treatment may be one or more of a gene therapy (including, for example, a therapy based on CRISPR-Cas9 gene editing techniques), an immunotherapy, irradiation of cancerous tissue, administration of one or more chemotherapeutic agents (including targeted drug therapy and embolization therapy), transplantation, ablation, and/or surgical resection of cancerous tissue.

[00112] In non-limiting embodiments, the treatment that is specific to hepatocellular carcinoma is one or more of bevacizumab, cabozantinib, lenvatinib, ramucirumab, regorafenib, sorafenib, atezolizumab (with or without bevacizumab or cabozantinib), durvalumab (with or without tremelimumab), nivolumab (with or without ipilimumab), pembrolizumab, and combinations thereof.

[00113] In non-limiting embodiments, provided herein are devices and systems for training, validating, and/or applying a machine learning model as described herein. Devices and systems suitable for training, validating, and/or applying such machine learning model(s) may have one or more components shown in FIG. 2 and described above. Useful systems may include one or more devices 200 shown in FIG. 2, e.g., a device for training a machine learning model, a device (the same or different) for validating the machine learning model, and/or a device (the same or different) for applying the trained machine learning model.

Examples

[00114] In this study, nine fusion transcripts in serum samples from 136 individuals were analyzed, with serum fusion transcript detection being predictive of hepatocellular carcinoma.

Materials and Methods

Tissue Samples

[00115] The 197 serum specimens and sera used in the study consisted of serum samples obtained from 61 patients with HCC both before and after treatment, and 75 serum samples obtained from individuals without HCC. These samples were obtained from the Pittsburgh Liver Research Center biospecimen core, University of Pittsburgh (Pittsburgh, PA), in compliance with institutional regulatory guidelines. The informed- consent exemptions and protocol were approved by the Institutional Review Board at the University of Pittsburgh. All serum samples were fresh-frozen and stored at -80 °C. RNA Extraction, cDNA Synthesis, and Detection of Fusion Genes

[00116] The procedure was described previously. Total RNA was extracted using TRIzol to lyse the cancer tissues (InvitroGen, Inc., Carlsbad, CA). First-strand cDNA was synthesized using approximately 5 ng of total RNA from each sample, random hexamers, and Superscript II (InvitroGen) at 42°C for 2 hours. One microliter of each cDNA sample was used for TaqMan PCR reaction, with 50 heat cycles at 94 °C for 30 seconds, 61 °C for 30 seconds, and 72 °C for 30 seconds using primers and probes specific for MAN2A1-FER [forward/re verse (F/R), 5'-

AGCGCAGTTTGGGATACAGCA-3' (SEQ ID NO: 1 ); 5'-

CTTTAATGTGCCCTTATATACTTCACC-3' (SEQ ID NO: 2); TaqMan probe, 5756- FAM/TCAGAAAC A/ZEN/GCCTATGAGGGAAATT/3IABkFQ/3' (SEQ ID NO: 3)], SLC45A2-AMACR (F/R, 5'-TTGATGTCTGCTCCCATCAGG-3' (SEQ ID NO: 4); 5'- CAGCTGGAGTTTCTCCATGAC-3' (SEQ ID NO: 5); TaqMan probe, 5’-/56- FAM/AAGAGGGCA/ZEN/TGGTAGTGGAGGC/3IABkFQ/-3' (SEQ ID NO: 6)), CCNH-C5orf30 (F/R, 5'-AAAGTTATTTATCAGAGAGTCTGATGCTG-3' (SEQ ID NO: 7); 5'-CTGTTCTACTCCAGGTATTTTCATTATATC-3' (SEQ ID NO: 8); TaqMan probe, 5’-/56-FAM/ACAGGCAAG/ZEN/TTCTGTTCTCTTTCAGCA/3IABkFQ/-3' (SEQ ID NO: 9)), PCMTD1-SNTG1 (F/R, 5'-CTGGAGAGCTTCATCAAAAATAG-3' (SEQ ID NO: 10); 5'-CACTTCTCGGGCAATCTCAACA-3' (SEQ ID NO: 11 ); TaqMan probe, 5'- /56-FAM/AGCTTTGAT/ZEN/AAACTGCTCTCCAGAATGTTG/3IABkFQ/-3' (SEQ ID NO: 12)), ZMPSTE24-ZMYM4 (F/R, 5'-GAGGAAGAAGGGAACAGTGAAGA-3' (SEQ ID NO: 13); 5'-CTGGAATAGGGCTCAGTAAAAATGTTATC-3' (SEQ ID NO: 14); TaqMan probe, 5'-/56-FAM/AGACACAGC/ZEN/AGGATGCCAATG/3IABkFQ/-3' (SEQ ID NO: 15)), PTEN-NOLC1 (F/R, 5'-AAGCCAACCGATACTTTTCTCCA-3' (SEQ ID NO: 16); 5'-ATAGATGTCTAAGAGGGAAGAGG-3' (SEQ ID NO: 17); TaqMan probe, 5’-/56-FAM/AGACACAGC/ZEN/AGGATGCCAATG/3IABkFQ/-3' (SEQ ID NO: 18)), STAMBPL1-FAS (F/R, 5'-TTCATCCACACACCAAGGAGC-3' (SEQ ID NO: 19); 5'-TGTGCCAGCCTTGTGCACACA-3' (SEQ ID NO: 20); TaqMan probe, 5’-/56-FAM/CAGGCTGTT/ZEN/CAGTATGCTCAG/3IABkFQ/-3' (SEQ ID NO: 21 )), VAPB-GNAS (F/R, 5'-AAGGTGGAGCAGGTCCTGAG-3' (SEQ ID NO: 22); 5'- CTCATCTGCTTCACAATGGTGC-3' (SEQ ID NO: 23); TaqMan probe, 5756- FAM/TCAAATTCC/ZEN/GAGGTGCTGGAG/3IABkFQ/-3' (SEQ ID NO: 24)), ZNF124-SMYD3 (F/R, 5'-GATGTCGGGACACCCCGGAA-3' (SEQ ID NO: 25); 5'- GGCACTGAGAGCATCGCATC-3' (SEQ ID NO: 26); TaqMan probe, 5'-/56- FAM/CAGGCTGTT/ZEN/CAGTATGCTCAG/3IABkFQ/-3' (SEQ ID NO: 27)), and - actin (F/R, 5'-ACCCCACTTCTCTCTAAGGAG-3' (SEQ ID NO: 28); 5'- GCAATGCTATCACCTCCCCTG-3' (SEQ ID NO: 29); TaqMan probe, 5’-/56- FAM/CCAGTCCTC/ZEN/TCCCAAGTCCACAC/3IABkFQ/-3' (SEQ ID NO: 30)) in a thermocycler (QuantStudio 3; Applied Biosystems Inc., Waltham, MA). A negative control and synthetic positive control were included in each batch of reactions. The PCR products were gel-purified and Sanger-sequenced on 10% positive samples. The threshold determined as positive for fusion transcript was cycle threshold (CT) < 45. Prediction Model on Fusion-Gene Profile

[00117] Machine-learning models were introduced to predict the non-HCC and HCC cases based on the status of the fusion genes. The CT values of nine fusions were measured by TaqMan real-time quantitative RT-PCR and served as the input feature of the machine-learning algorithms. In the training cohort, the associations between individual fusions and non-HCC/HCC status at different CT cutoffs were calculated, and the cutoffs with the highest prediction accuracy, area under the on the receiver operating characteristic curve, and Youden index (sensitivity + specificity - 1 ) were selected. Then in the multi-fusion models, different fusion combinations with their selected best cutoffs were used as the input feature for the machine-learning model to predict the binomial outcome for the disease status (non-HCC versus HCC). Four machine-learning algorithms were employed, including support vector machine, random forest, linear discriminant analysis, and logistic regression. Leave-one-out cross-validation (LOOCV) was performed on the training set to select the best model. Finally, the four fusion genes logistic regression model (MAN2A1-FER<40, CCNH- C5orf30<38, SLC45A2-AMACR<41 , and PTEN-NOLC1 <40) was selected among all of the parameter and algorithm combinations. In the next step, this best model was applied to the testing cohort for performance evaluation. Ultimately, LOOCV was applied to all of the data (pooled training and testing cohorts) to generate the best model for the prediction of new cases in the combined cohort. Biostatistical analysis was performed by statistical software package R version 4.2.2 (https://cran.r- project.org/bin/windows/base) with the following packages: randomForest version 4.7- 1.1 , MASS version 7.3-60.0.1 , and e1071 version 1 .7-13.

Prediction Model Integrating Fusion Genes and Serum AFP

[00118] When serum AFP was available for the prediction of HCC and benign samples, different AFP cutoffs were applied to generate the receiver operating characteristic curve. In the integration model, the fusion-gene profile was combined with the serum AFP to train the fusion + AFP machine-learning model. When the probability was <0.5, it was classified as non-HCC. When the probability was >0.5, it was deemed HCC. Similar to the models involving only fusion-gene data, the models integrating fusion gene + AFP were applied to the training cohort. The best parameters selected by LOOCV were used as the final model for the training cohort and then applied to the testing cohort for evaluation. Lastly, the data from both cohorts were pooled together to provide a final prediction model for the combined cohort. Biostatistical analysis was performed by statistical software package R. Results

[00119] To analyze the utility of serum fusion transcripts in patients with HCC, a cohort of 136 individuals, including 61 individuals with known HCC and 75 individuals with non-HCC medical conditions, was recruited. All 61 patients with HCC were treated with several types of therapies, including liver transplantation, transarterial chemical embolism, radiofrequency ablation, yttrium 90 isotope radiation, and/or surgical resection. The serum samples were collected before treatment. To monitor the impact of treatment on the serum fusion-transcript levels, additional samples were collected after treatment. A total of 59% of the HCC cases were Milan-IN status, while 41 % were Milan-OUT. The overall mortality rate was 77% (47/61 ).

Fusion Transcripts Are Frequently Present in the Serum Samples of Patients with HCC [00120] Nine fusion transcripts were analyzed in the serum samples of 136 individuals. In the pre-treatment serum samples from patients with HCC, the most frequent fusion transcript detected was MAN2A1-FER (100%, 61/61 ) (FIGS. 3A-3B), followed by SLC45A2-AMACR and ZMPSTE24-ZMYM4 (both, 62.3%, 38/61 ). The other frequently detected fusion transcripts in the pre-treatment HCC serum samples were PTEN-NOLC1 and CCNH-C5orf30, accounting for 57.4% (35/61 ) and 55.7% (34/61 ) of the total samples, respectively. The relatively low-frequency fusion transcripts were STAMBPL1-FAS (26.2%, 16/61 ), PCMTD1-SNTG1 (16.4%, 10/61 ), and ZNF124-SMYD3 (16.4%, 10/61 ). The frequencies of these fusion transcripts in the sera of patients with HCC dropped significantly after treatment (FIGS. 3A-3B). Interestingly, 30.7% (23/75) of individuals without HCC were also positive for MAN2A1-FER, albeit mostly in lower quantities (FIGS. 3A-3B). Other fusion transcripts were also detected in non-HCC individuals, but at relatively low frequencies. The Fisher exact test showed that seven fusion genes — MAN2A1-FER (P = 1.4 x 10-19), CCNH-C5orf30 (P = 1.1 x 10-5), SLC45A2-AMACR (P = 2.8 x 10-9), ZMPSTE24-ZMYM4 (P = 9.3 x 10-6), PTEN-NOLC1 (P = 1.8 x 10-6), PCMTD1-SNTG1 (P = 0.006), and STAMBPL1-FAS (P = 0.01 ) — had significantly greater frequencies of expression in the serum samples of patients with HCC, suggesting that the detection of these fusion transcripts in the serum samples of patients with HCC likely predicts the risk for HCC.

[00121] Association analyses indicated that the presence of PTEN-NOLC1 in the pre-treatment serum samples was associated with poor survival (P = 0.046) and an increased risk for death (P = 0.004). Strong expression of MAN2A1-FER (<36 cycles) in the pre-treatment serum samples was also associated with poor survival (P = 0.03). The persistent presence of the MAN2A1-FER fusion transcript (<41 cycles, P = 0.029) in the serum samples after treatment was associated with a greater risk for less favorable survival outcomes. In addition, the risk for cancer progression/recurrence increased significantly if CCNH-C5orf30 (P = 0.026) or ZMPSTE24-ZMYM4 (P = 0.024) was detected in the serum samples after treatment.

Prediction of HCC Occurrence by Serum Fusion Transcripts

[00122] To investigate whether serum fusion transcripts predict the occurrence of HCC, samples collected from 2009 to 2015 were utilized as a training cohort, while samples collected from 2016 to 2021 were designated as the testing cohort. There were 82 individuals in the training cohort, including 31 patients with HCC and 51 individuals without HCC. Different combinations of five fusion genes (MAN2A1-FER, CCNH-C5orf30, SLC45A2-AMACR, ZMPSTE24-ZMYM4, and PTEN-NOLC1 ) with different cutoff thresholds were used to construct 716 machine-learning models based on the pre-treatment serum fusion-gene data in the training cohort, using LOOCV analysis. A total of 97% (691/716) of these models had a prediction accuracy exceeding 70%, while 56% (404/716) had an accuracy exceeding 80%. One of these models, called four fusion genes logistic regression (MAN2A1-FER<40, CCNH- C5orf30<38, SLC45A2-AMACR<41 , and PTEN-NOLC1 <40), reached an accuracy of 91.5%, with a sensitivity of 87.1 % and a specificity of 94.1 % (FIG. 4). To test the validity of the model, this algorithm was then applied to the serum samples from the testing cohort, which were collected at a later time and included samples from 30 patients with HCC and 24 non-HCC individuals. The validation test showed that the four fusion genes logistic regression model correctly predicted 83.3% of the statuses of these cases, with a sensitivity of 73.3% and a specificity of 95.8% (FIG. 4). Combining the training and testing cohorts produced an accuracy of 86% with a sensitivity of 82% and a specificity of 89.3% (FIG. 4).

[00123] Survival analysis showed that 96% (49/51 ) of individuals survived >10 years when predicted as non-HCC by the four fusion genes logistic regression model in the training cohort, while only 19% (6/31 ) survived a similar period when predicted as HCC (FIG. 5). In the testing cohort, 84% (26/31 ) of individuals predicted as non- HCC survived >5 years, while only 17% (4/23) predicted as HCC survived a similar period (FIG. 5). When both cohorts were combined, the same four fusion genes logistic regression model produced a 5-year survival rate of 92% (71/77) in individuals predicted as non-HCC and a 5-year survival rate of 17% (10/58) in individuals predicted as HCC (FIG. 5).

Incorporation of Serum AFP with Fusion Gene Model

[00124] AFP is the primary screening tool for detecting HCC. However, a few non- HCC conditions, including hepatic cirrhosis, may lead to the elevation of AFP. There is a consensus that a level of >400 ng/mL may indicate the presence of HCC. Applying a >400 ng/mL AFP cutoff generated an accuracy of 67.2%, with a sensitivity of 38.7% and a specificity of 100% in the training cohort, and an accuracy of 53.5% with a sensitivity of 33.3% and a specificity of 100% in the testing cohort. The overall accuracy of the AFP threshold of >400 ng/mL was 61.4% when applied to the combined cohorts. Lowering the threshold of AFP to >200 ng/mL improved the prediction accuracy to 67%, with a sensitivity of 45.9% and specificity of 100%, in the combined cohorts. No individuals with an AFP of >400 or >200 ng/mL survived through 100 months (FIGS. 6-7), while 47.4% (37/78) of individuals with an AFP of <400 ng/mL and 55.6% (40/72) of individuals with an AFP of <200 ng/mL survived through the similar period. These results suggested that high serum AFP had a high specificity in predicting the occurrence of HCC and may complement the fusion-gene model.

[00125] To investigate the impact of serum AFP level on the fusion-gene model, 716 fusion-gene models were incorporated with serum AFP elements. LOOCV was performed on the serum samples from the training cohort. As shown in, 632 of these models had prediction accuracies exceeding 80% in the training cohort, while 101 of these models had accuracies exceeding 90%. One of these models, called the two- fusion gene (MAN2A1-FER<40, CCNH-C5orf30<38) + AFP logistic regression model, had an accuracy of 94.8%, with a sensitivity of 93.5% and a specificity of 96.3% (FIG. 8). Applying this algorithm to the testing data set resulted in similar accuracy: 95% accuracy, with a sensitivity of 96.7% and a specificity of 92.3% (FIG. 8). On combining both cohorts, the two-fusion gene + AFP logistic regression model produced an accuracy of 95%, a sensitivity of 95.1 %, and a specificity of 95% (FIG. 8). The similar accuracies of this model throughout these cohorts suggest that the model is consistent and stable.

[00126] Of those individuals predicted as non-HCC by the two-fusion gene + AFP model in the training cohort, 96% (27/28) experienced >10 years of survival (FIG. 9). On the other hand, of those predicted as HCC, only 17% (5/29) experienced a similar period of survival. In the testing set, individuals predicted as non-HCC had a 5-year survival rate of 92.3% (12/13), while individuals predicted as HCC had a 5-year survival rate of 23% (7/30) (FIG. 9). In the combined cohorts, individuals predicted as non-HCC had a survival rate of 95% for at least 5 years. Meanwhile, those individuals predicted as HCC had a survival rate of 15.2% for a similar period (FIG. 9). These results suggest that the two-fusion gene + AFP logistic regression model was effective in predicting the survival related to HCC.

Impact of Treatment on the Serum Level of Fusion Transcripts

[00127] To examine the potential utility of serum fusion transcripts in monitoring the effects of treatment and the progression of HCC, the post-treatment serum fusion transcripts were analyzed in patients with HCC who had recently undergone treatment for cancer to assess the impact of treatment on fusion-transcript levels. Forty-one percent (25/61 ) of patients with HCC experienced a complete negative conversion of fusion transcripts in their serum samples after the treatment. Most of the patients, however, experienced a partial reduction in fusion-transcript levels in the posttreatment sera or experienced only negative conversion of some fusion transcripts but not others. Some patients even had the emergence of new fusion transcripts not detected in the serum samples obtained before treatment. The different dynamics of serum transcripts for different fusion genes after treatment suggested significant heterogeneity of the liver cancers in these individuals and may reflect the dynamics of HCC tumor load impacted by treatment.

[00128] As shown in FIGS. 10A-10I, 95% (58/61 ) of patients with HCC experienced a drop in MAN2A1-FER transcript levels in the sera obtained after treatment. Three patients who had no decrease in MAN2A1-FER transcript showed distant metastases or HCC progression shortly after treatment. The treatment of HCC also resulted in decreases in SLC45A2-AMACR and ZMPSTE24-ZMYM4 in 92.1 % and 86.8% of the patients with HCC, respectively. Interestingly, in 2 patients with HCC, SLC45A2- AMACR emerged as a new transcript in the post-treatment serum, while in 8, ZMPSTE24-ZMYM4 emerged as a new transcript after treatment. Seven of eight patients with HCC with serum resurgence of ZMPSTE24-ZMYM4 had either progression or recurrence. The impact of cancer treatments was also found in the serum levels of CCNH-C5orf30: 97.1 % (33/34) of patients with HCC experienced a drop in the serum levels of the transcript after cancer treatment. However, 5 individuals had CCNH-C5orf50 emerge as a new transcript in the serum samples after treatment, and all 5 patients with HCC experienced progression, recurrence, or distant metastasis. When serum transcript dynamics of CCNH-C5orf50 before and after treatment were used as a factor to assess the risk for HCC progression/recurrence, the serum level drop of less than seven PCR cycles for the CCNH-C5orf30 transcript in sera after treatment was found to be associated with an increased risk for HCC recurrence/progression (P = 0.016). A similar association between ZMPSTE24- ZMYM4 dynamics and HCC progression was found: A serum level drop of four cycles or less for the ZMPSTE24-ZMYM4 transcript in sera after treatment signaled an increased risk for HCC recurrence/progression (P = 0.03).

Discussion

[00129] All five fusion genes in the prediction models resulted from the underlying chromosome abnormalities in the cancer genome. Their occurrences were the results of chromosome recombination in the cancer cells. They were the products of pathologic processes essential for cancer development. Normal tissues did not contain these chromosome features and, thus, did not express these transcripts. The detection of these fusion transcripts in the serum samples implies the presence of cancer cells. The widespread presence of these fusion transcripts in the serum samples from patients with HCC suggests that HCC cells shed these RNA fragments into the bloodstream. Possibly, the cancer cells also entered the blood circulation. Most patients with HCC achieved lower or even negative fusion-transcript levels after cancer treatment, suggesting that the cancer cells were the source of these fusion transcripts. Thus, detecting these fusion transcripts in the serum may have utility in screening for HCC and in assessing the cancer load. The machine-learning model tests developed from this study could be utilized in the following scenarios: i) to determine whether a liver biopsy is necessary in patients with a mass <LR5 or a nodule of unknown significance on liver imaging, ii) to evaluate cancer of the liver in patients with normal serum AFP levels, iii) to assess the efficacy of treatment and the presence of residual cancer in patients with HCC who have undergone treatment, and iv) as a routine screening test for early HCC in high-risk individuals before any expensive radiology imaging is performed.

[00130] The non-HCC cohort comprised individuals with other medical conditions. Most of them had liver disease. A large number of these individuals had pre-HCC conditions, including cirrhosis/fibrosis (29.3%, 22/75), nonalcoholic steatohepatitis/steatosis (32%, 24/75), hepatitis virus B/C infection (4%, 3/75), and alcoholic liver disease (2.6%, 2/75). A significant number of individuals from control groups were also positive for oncogenic fusion transcripts, such as MAN2A1-FER, SLC45A2-AMACR, and PTEN-NOLC1 , albeit mostly at lower frequencies and not reaching the threshold of HCC determined by the machine-learning models. Many of these individuals had a history of other malignancies, suggesting that these fusion genes may come from other cancers. Yet, some individuals had no known history of malignancy but tested positive for these fusion transcripts. The cause of the presence of these fusion transcripts in these individuals' blood samples remained unclear. One speculation was that cancer cells shedding these fusion-transcript fragments were very dispersed and had not yet formed an imageable nodule. Alternatively, these fusion genes may be present in pre-malignant tissue, and the pre-malignant cells shed the fusion transcripts into the bloodstream. Regardless of either interpretation, clinical follow-up of these individuals might be warranted given that positive fusion transcripts in the blood indicated genome alterations in some somatic cells.

[00131] One interesting finding of this study was the dynamic changes in serum fusion-transcript levels after treatment. While in most individuals, fusion transcripts dropped to undetectable levels after treatment, significant numbers of patients with HCC had only partially decreased serum fusion transcripts. In many cases, some fusion transcripts disappeared from the serum, while others remained unchanged. New fusion transcripts emerged in the serum after treatment in a few instances. All of these scenarios suggested the multiclonal and multifocal nature of the HCC. HCC may occur in multiple locations in the liver. Location-based treatment may eliminate most imageable cancer nodules but may leave the unidentified HCC nodules untreated. In addition, new cancer nodules might arise after treatment if a patient has a high-risk background for HCC. Given that each fusion gene results from a specific genome alteration, the fusion-gene patterns may reflect specific clonal origins of the cancers. As a result, active serum fusion-transcript surveillance may hold promise to be used to assess the impact of treatment and the heterogeneity of HCC in an accurate and timely manner.

Claims

THE INVENTION CLAIMED IS

1 . A computer-implemented method comprising: receiving, with at least one processor, training data, the training data comprising genomic data and diagnostic data from a plurality of patients, wherein the genomic data comprises fusion gene data comprising data relating to presence of one or more of CCNH-C5orf30, SLC45A2-AMACR, PTEN-NOLC1 , ZMPSTE24-ZMYM4, PTEN-NOLC1 , STAMBPL1 -FAS, and PCMTD1 -SNTG1 in a sample obtained from the plurality of patients; determining, with at least one processor and based at least on the genomic data and the diagnostic data, an association between the genomic data and the diagnosis data; and training, with at least one processor and based on the association, a machine learning model, a machine learning model configured to generate an output, the output comprising a prediction of a patient diagnosis.

2. The computer-implemented method of claim 1 , wherein the machine learning model is a support vector machine.

3. The computer-implemented method of claim 1 , wherein the machine learning model is a random forest.

4. The computer-implemented method of claim 1 , wherein the machine learning model is a linear discriminant analysis.

5. The computer-implemented method of claim 1 , wherein the machine learning model is a logistic regression.

6. The computer-implemented method of any one of claims 1 -5, wherein the genomic data comprises fusion gene data.

7. The computer-implemented method of claim 6, wherein the fusion gene data is selected based on a cycle threshold of less than or equal to 45.

8. The computer-implemented method of claim 7, wherein the cycle threshold is obtained using a real-time quantitative polymerase chain reaction (RT- PCR) assay.

9. The computer-implemented method of claim 7 or claim 8, wherein the RT-PCR assay utilizes at least one primer, wherein the at least one primer is at least one of SEQ ID NOS:1 , 2, 4, 5, 7, 8, 10, 1 1 , 13, 14, 16, 17, 19, 20, 22, 23, 25, 26, 28, and 29.

10. The computer-implemented method of any one of claims 1 -9, wherein the training data further comprises serum protein data from the plurality of patients, and wherein: the determining step comprises determining, with at least one processor and based at least on the genomic data and the serum protein data, an association between the genomic data and the serum protein data and the diagnostic data.

1 1. The computer-implemented method of claim 10, wherein the serum protein data comprises serum a-fetoprotein (AFP) data.

12. The computer-implemented method of claim 1 1 , wherein the serum AFP data comprises a comparison of an amount of AFP in a sample obtained from a patient compared to a threshold value.

13. The computer-implemented method of claim 12, wherein the threshold value is 200 ng/mL.

14. The computer-implemented method of claim 12, wherein the threshold value is 400 ng/mL.

15. The computer-implemented method of any one of claims 1 -12, wherein the genomic data comprises fusion gene data comprising data relating to presence of one or both of CCNH-C5orf30 and ZMPSTE24-ZMYM4 in the sample obtained from the plurality of patients.

16. A computer-implemented method comprising: receiving, with at least one processor, training data, the training data comprising: fusion gene data relating to presence of ZMPSTE24-ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based on the association, a logistic regression machine learning model, the machine learning model configured to generate an output comprising a prediction of whether a patient has hepatocellular carcinoma.

17. A computer-implemented method of determining that a patient has hepatocellular carcinoma comprising: applying a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by: receiving, with at least one processor, training data, the training data comprising: fusion gene data relating to presence of ZMPSTE24- ZMYM4 and CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate an output comprising a prediction of whether the patient has hepatocellular carcinoma, and generating, with at least one processor and based at least in part on application of the trained logistic regression machine learning model to the patient serum AFP data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.

18. The computer-implemented method of claim 17, wherein the patient is suspected of having hepatocellular carcinoma.

19. The computer-implemented method of claim 17, wherein the patient has previously been diagnosed with hepatocellular carcinoma and has previously undergone treatment for hepatocellular carcinoma.

20. The computer-implemented method of claim 17, further comprising receiving, with at least one processor, imaging data relating to the patient’s liver.

21. The computer-implemented method of claim 20, further comprising generating, based on the patient serum protein data, the patient serum fusion gene data, and the imaging data, a treatment plan for the patient.

22. A method of treating a patient having hepatocellular carcinoma, comprising: diagnosing a patient with hepatocellular carcinoma, wherein the diagnosis is based at least in part on application of a trained logistic regression machine learning model to serum fusion gene data and serum protein data obtained from the patient, the machine learning model trained by receiving, with at least one processor, training data, the training data comprising: fusion gene data relating to presence of ZMPSTE24- ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate an output comprising a prediction of whether the patient has hepatocellular carcinoma; and treating the patient, wherein treating the patient comprises at least one of gene therapy, immunotherapy, irradiation of cancerous tissue, administration of one or more chemotherapeutic agents, and/or surgical resection of cancerous tissue.

23. A system comprising at least one processor and a non-transitory computer-readable medium storing programming instructions configured to cause the at least one processor to: receive training data, the training data comprising genomic data and diagnostic data from a plurality of patients; determine an association between the genomic data and the diagnostic data; and train, based at least on the association, a machine learning model, the machine learning model configured to generate an output comprising a prediction of a patient diagnosis.

24. The system of claim 23, wherein the machine learning model is a support vector machine.

25. The system of claim 23, wherein the machine learning model is a random forest.

26. The system of claim 23, wherein the machine learning model is a linear discriminant analysis.

27. The system of claim 23, wherein the machine learning model is a logistic regression.

28. The system of any one of claims 23-27, wherein the genomic data comprises fusion gene data.

29. The system of claim 28, wherein the fusion gene data is selected based on a cycle threshold of less than or equal to 45.

30. The system of any one of claims 23-29, wherein the training data further comprises serum protein data from the plurality of patients, and wherein: the determining step comprises determining, with at least one processor and based at least on the genomic data and the serum protein data, an association between the genomic data, the serum protein data, and the diagnostic data.

31. The system of claim 30, wherein the serum protein data comprises serum AFP data.

32. The system of claim 31 , wherein the serum AFP data comprises a comparison of an amount of AFP in a sample obtained from a patient to a threshold value.

33. The system of claim 32, wherein the threshold value is 200 ng/mL.

34. The system of claim 32, wherein the threshold value is 400 ng/mL.

35. The system of any one of claims 23-34, wherein the genomic data comprises fusion gene data comprising data relating to presence of one or more of MAN2A1 -FER, CCNH-C5orf30, SLC45A2-AMACR, PTEN-NOLC1 , ZMPSTE24- ZMYM4, PTEN-NOLC1 , STAMBPL1 -FAS, and PCMTD1 -SNTG1 in a sample obtained from a patient.

36. A system comprising at least one processor and a non-transitory computer-readable medium storing programming instructions configured to cause the at least one processor to: apply a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by: receiving, with at least one processor, training data comprising: fusion gene data relating to presence of ZMPSTE24- ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate output comprising a prediction of whether the patient has hepatocellular carcinoma, generate, based on application of the trained logistic regression machine learning model to the patient serum protein data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.

37. The system of claim 36, wherein the patient serum protein data comprises a comparison of an amount of AFP in a sample obtained from the patient compared to a threshold value.

38. The system of claim 37, wherein the threshold value is 200 ng/mL.

39. The system of claim 37, wherein the threshold value is 400 ng/mL.

40. The system of any one of claims 36-39, wherein the patient genomic data comprises fusion gene data comprising data relating to presence of one or more of CCNH-C5orf30 and/or ZMPSTE24-ZMYM4 in a sample obtained from the patient.

41. A computer-readable, non-transitory medium comprising programming instructions that when executed by a processor cause the processor to: apply a trained logistic regression machine learning model to patient serum protein data and patient serum fusion gene data, the machine learning model trained by: receiving, with at least one processor, training data comprising: fusion gene data relating to presence of ZMPSTE24- ZMYM4 and/or CCNH-C5orf30 in a sample obtained from a plurality of historic patients; serum protein data relating to an amount of AFP in a sample obtained from the plurality of historic patients; and diagnostic data relating to a diagnosis of hepatocellular carcinoma in the plurality of historic patients; determining, with at least one processor and based at least on the training data, an association between the genomic data, the serum protein data, and the diagnostic data; training, with at least one processor and based at least in part on the association, the machine learning model, the machine learning model configured to generate output comprising a prediction of whether the patient has hepatocellular carcinoma; and generate, based on application of the trained logistic regression machine learning model to the patient serum protein data and the patient serum fusion gene data, a prediction of whether the patient has hepatocellular carcinoma.