[go: up one dir, main page]

WO2025129061A1 - Systèmes et procédés de masquage de régions affectées par un traitement dans le génome pour améliorer les performances d'un classificateur - Google Patents

Systèmes et procédés de masquage de régions affectées par un traitement dans le génome pour améliorer les performances d'un classificateur Download PDF

Info

Publication number
WO2025129061A1
WO2025129061A1 PCT/US2024/060122 US2024060122W WO2025129061A1 WO 2025129061 A1 WO2025129061 A1 WO 2025129061A1 US 2024060122 W US2024060122 W US 2024060122W WO 2025129061 A1 WO2025129061 A1 WO 2025129061A1
Authority
WO
WIPO (PCT)
Prior art keywords
treatment
feature
nucleic acid
sample
methylation data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2024/060122
Other languages
English (en)
Inventor
Yifan Zhou
Collin MELTON
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Grail Inc
Original Assignee
Grail Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Grail Inc filed Critical Grail Inc
Publication of WO2025129061A1 publication Critical patent/WO2025129061A1/fr
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/106Pharmacogenomics, i.e. genetic variability in individual responses to drugs and drug metabolism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Definitions

  • the present disclosure relates generally to the field of bioinformatics and genomics and, more specifically, to systems and methods for improving the performance of a disease state classifier.
  • CRT chemoradiotherapy
  • CRT is a widely used approach that combines chemotherapy and radiation therapy to target and eradicate cancerous cells.
  • CRT can induce changes in cell-free DNA (cfDNA) and DNA methylation patterns in cancer subjects.
  • cfDNA cell-free DNA
  • DNA methylation patterns in cancer subjects may introduce confounding factors that can adversely affect the performance of cancer detecting classifiers.
  • One or more aspects of this disclosure may address one or more of the issues described above.
  • systems and methods are described for masking treatment-affected genomic regions in training data to improve the performance of a machine learning classifier trained to predict a disease state in a sample.
  • a computer-implemented method may include: receiving, at a computing device, a first set of nucleic acid methylation data and a second set of nucleic acid methylation data, wherein the first set of nucleic acid methylation data is associated with a pretreatment sample and wherein the second set of nucleic acid methylation data is associated with a post-treatment sample; comparing, using a processor of the computing device, a first feature set of the first set of nucleic acid methylation data against a second feature set of the second set of nucleic acid methylation data; determining, based on the comparing, at least one treatment affected feature in the second feature set of the second set of nucleic acid methylation data; and implementing, based on the determining, an exclusion process on the at least one treatment affected feature in the second feature set of the second set of nucleic acid methylation data.
  • a system may include: one or more processors; one or more computer readable media storing instructions that are executable by the one or more processors to perform operations to: receive, at a computing device associated with the system, a first set of nucleic acid methylation data and a second set of nucleic acid methylation data, wherein the first set of nucleic acid methylation data is associated with a pre-treatment sample and wherein the second set of nucleic acid methylation data is associated with a post-treatment sample; compare a first feature set of the first set of nucleic acid methylation data against a second feature set of the second set of nucleic acid methylation data; determine, based on the comparing, at least one treatment affected feature in the second feature set of the second set of nucleic acid methylation data; and implement, based on the determining, an exclusion process on the at least one treatment affected feature in the second feature set of the second set of nucleic acid methylation data.
  • a non-transitory computer-readable medium storing computer-executable instructions.
  • the non-transitory computer-readable medium stores computer-executable instructions which, when executed by a system, may cause the system to perform operations comprising: receiving, at a computing device, a first set of nucleic acid methylation data and a second set of nucleic acid methylation data, wherein the first set of nucleic acid methylation data is associated with a pre-treatment sample and wherein the second set of nucleic acid methylation data is associated with a post-treatment sample; comparing, using a processor of the computing device, a first feature set of the first set of nucleic acid methylation data against a second feature set of the second set of nucleic acid methylation data; determining, based on the comparing, at least one treatment affected feature in the second feature set of the second set of nucleic acid methylation data; and implementing, based on the determining, an exclusion process on the at
  • the computer-implemented method may include: receiving, at a computing device, a first set of nucleic acid methylation data and a second set of nucleic acid methylation data, wherein the first set of nucleic acid methylation data is associated with a first sample associated with a first treatment condition and wherein the second set of nucleic acid methylation data is associated with a second sample associated with a second treatment condition; comparing, using a processor of the computing device, a first feature set of the first set of nucleic acid methylation data against a second feature set of the second set of nucleic acid methylation data; determining, based on the comparing, at least one treatment affected feature in the second feature set of the second set of nucleic acid methylation data; and implementing, based on the determining, an exclusion process on the at least one treatment affected feature in the second feature set of the second set of nucleic acid methylation data.
  • FIG. 1A depicts an exemplary computer system for executing the methods described herein.
  • FIG. 1 B depicts an exemplary software platform for executing the methods described herein.
  • FIG. 2 depicts an exemplary workflow for identifying and removing treatment-affected features, according to one or more embodiments of the present disclosure.
  • FIG. 3 depicts an exemplary diagram illustrating a process for removing and down-weighting treatment-affected features, according to one or more embodiments of the present disclosure.
  • FIG. 4 depicts a flowchart of an exemplary method of masking treatment-affected regions, according to one or more embodiments of the present disclosure.
  • FIG. 5 depicts an example computing system, according to one or more embodiments of the present disclosure.
  • CRT chemoradiotherapy
  • chemotherapy which involves medications that target cancer cells throughout the body
  • radiation therapy which employs high-energy radiation beams, to destroy cancerous cells. While CRT is effective in many cases, it may affect the subject’s body, including modifying cell-free DNA (cfDNA) and the release of cfDNA into the bloodstream.
  • cfDNA cell-free DNA
  • CfDNA is genetic material that originates from cells and can be found circulating in the plasma of blood, among other biofluids.
  • Samples of cfDNA may carry information about the disease state of a subject from which the sample was extracted.
  • cfDNA can carry information about the presence of genetic mutations associated with cancer.
  • Detecting disease-related, e.g., cancer- related, signals in cfDNA has shown promise as a non-invasive method for disease, e.g., cancer, diagnosis and monitoring. For example, one observation in cancer subjects undergoing CRT is increased levels of cfDNA in their plasma. This phenomenon is attributed to the destruction of cancer cells and the release of their DNA into the bloodstream.
  • the treatment for a disease — or another treatment that a subject may undergo — may cause changes in cfDNA that confounds the ability to use cfDNA to detect the disease.
  • One challenge in cancer detection lies in distinguishing the presence of cancer-specific signals in cfDNA from the effects induced by cancer treatment, such as CRT.
  • the quantity of cfDNA may change during cancer treatment and/or CRT may induce changes to the DNA methylation patterns within the human genome.
  • Changes in methylation patterns may affect cancer detection and diagnosis, as they can alter the genetic markers used by diagnostic classifiers trained to identify cancer. More particularly, current state-of-the-art classifiers may analyze cfDNA to identify the presence of cancer and estimate tumor fraction.
  • the altered methylation patterns in cfDNA resulting from cancer treatment may introduce confounding factors that can adversely affect the performance of these classifiers and may hinder the accurate estimation of tumor fractions in subjects undergoing treatment. These effects can limit the ability of cancer classifiers to predict and detect the progress of the subjects’ treatment. This in turn can limit the ability of practitioners and researchers to accurately tailor treatment to their patients’ specific disease progress and response to the treatment.
  • methylation alterations caused by a disease itself e.g., cancer
  • those caused by treatment may be challenging.
  • confounding factors may be introduced into training data that a disease-detecting classifier is trained on, thereby reducing its accuracy and reliability in disease detection.
  • accurate estimation of tumor fractions in subjects undergoing treatment may facilitate monitoring disease progression and/or making better-informed treatment decisions. Inaccuracies in this estimation may impact disease detection.
  • the present disclosure is designed to address the impact of treatment-induced changes in methylation patterns of nucleic acids, including cfDNA, in biopsy samples on the performance of disease, e.g., cancer, detecting classifiers.
  • the concepts described herein may be utilized to identify methylation patterns specifically affected by disease treatment, such as cancer treatment, and then “mask” these regions during classifier training. By systematically identifying and masking these treatment-affected methylation regions, the described concepts enable the classifier to differentiate the genuine disease- related, e.g., cancer-related, signals in cfDNA from those influenced by treatment.
  • This process may be facilitated by first comparing cfDNA samples from subjects undergoing treatment (“on-treatment” subjects) with those who have not yet undergone treatment (“treatment-naive” subjects). One or more statistical tests may be performed to evaluate the methylation differences between these two groups of subjects. Through this comparison, differentially methylated regions may be identified (e.g., genomic regions with test statistics that pass a predefined significance threshold and fold-change cutoff may be classified as differentially methylated regions). The identified treatment-affected regions may thereafter be masked from the training input that the classifier is trained on.
  • the concepts described herein integrate various technological improvements and correspondingly improve the functionality of a computing device as used in the detection and treatment of cancer in several ways.
  • the described concepts make the computer-based disease detection system more robust to the confounding effects of treatment on cfDNA methylation patterns.
  • the computer executing and implementing a classifier model trained and used according to the techniques described herein is better equipped to provide reliable disease, e.g., diagnoses, and tumor fraction estimates, even when treatment has altered the genomic methylation landscape.
  • the described concepts improve the model’s pattern recognition capability, thereby enabling it to extract meaningful insights from biological data that would be difficult, if not impossible, for humans to do.
  • the improved model resulting from the implementation of the described concepts may be more suitable for clinical applications, such as minimal residual disease (MRD) detection and monitoring changes in ctDNA levels over time.
  • MRD minimal residual disease
  • This improved functionality benefits healthcare professionals by providing them with more accurate and clinically relevant information for subject care.
  • the processes executed by the computer to improve the model involve complex calculations and data manipulations on a large amount of biological data that a human individual could not reasonably complete on their own or in their mind. Specifically, computationally intensive statistical tests are leveraged by the computer to evaluate methylation differences between sample sets, processes which cannot be completed by a human.
  • subject matter may be embodied as methods, devices, components, or systems. Accordingly, embodiments may, for example, take the form of hardware, software, firmware, or any combination thereof. The following detailed description is, therefore, not intended to be taken in a limiting sense.
  • Non-limiting cancer types that the concepts described herein may be applied to include, for example, breast cancer, lung cancer (e.g., non-small cell lung cancer (NSCLC)), prostate cancer, colorectal cancer, renal cancer, uterine cancer, pancreatic cancer, cancer of the esophagus, a lymphoma, head and neck cancer, ovarian cancer, a hepatobiliary cancer, a melanoma, cervical cancer, multiple myeloma, leukemia, thyroid cancer, bladder cancer, and gastric cancer.
  • NSCLC non-small cell lung cancer
  • the concepts described herein may be applicable to other disease types and other diseasedetecting classifiers. More generally, the approaches described herein may be applied to scenarios in which subjects are subject to conditions other than cancer treatment, e.g., a treatment relevant for the disease state being detected, that may also introduce confounding methylation patterns in nucleic acids, including plasma cfDNA, and affect the performance of a relevant disease-detecting classifier.
  • a treatment relevant for the disease state being detected may also introduce confounding methylation patterns in nucleic acids, including plasma cfDNA, and affect the performance of a relevant disease-detecting classifier.
  • FIG. 1 A depicts an exemplary system for masking treatment-affected regions to enhance the performance of a classifier.
  • Exemplary system 100 includes a data collection component 10, a database 20, and device data intelligence component 30, operably connected to each other via network 40.
  • a data collection component 10 includes a data collection component 10, a database 20, and device data intelligence component 30, operably connected to each other via network 40.
  • one or more of the components may be connected with another component locally without reliance on network connection; e.g., through a wired connection.
  • sequencing data of cell-free nucleic acids are used to illustrate the concepts.
  • a blood sample e.g., a serum sample, a plasma sample, a whole blood sample
  • a urine sample e.g., a saliva sample, a tissue sample, a bone marrow sample, etc.
  • data collection component 10 may include a device or machine with which sequencing data may be generated.
  • data collection component 10 may include one or more sequencing devices or a facility that uses one or more sequencing devices to generate nucleic acid (e.g., DNA or RNA) sequence data of biological samples.
  • data collection 10 may be a database that receives sequencing information generated from one or more sequencing devices. Any suitable liquid or solid biological samples may be used for sequencing.
  • a biological sample may be cell-based, for example, one or more types of tissue.
  • a biological sample may be a sample that includes cell-free nucleic acid fragments.
  • biological samples include, but are not limited to, a blood sample (e.g., a cell-free DNA (cfDNA) sample, a cell-free RNA (cfRNA) sample, a serum sample, a plasma sample, a whole blood sample), a urine sample, a saliva sample, a tissue sample, a bone marrow sample, etc.
  • a blood sample e.g., a cell-free DNA (cfDNA) sample, a cell-free RNA (cfRNA) sample, a serum sample, a plasma sample, a whole blood sample
  • a urine sample e.g., a saliva sample, a tissue sample, a bone marrow sample, etc.
  • RNA from these samples may alternatively or additionally be sequenced.
  • sequencing data may include, but are not limited to, sequence read data of targeted genomic locations, partial or whole genome sequencing data of the genome represented by nucleic acid fragments in cell-free or cell-based samples, partial or whole genome sequencing data including one or more types of epigenetic modifications (e.g., methylation), or combinations thereof.
  • epigenetic modifications e.g., methylation
  • Data acquired by the data collection component 10 may be transferred to database 20 via network 40 or a local or network connection.
  • data collection component 10 may alternatively receive data from one or more sequencing devices.
  • the collected data may be analyzed by data intelligence component 30, via network 40 or a local or network connection.
  • FIG. 1 B depicts exemplary functional modules that may be implemented to perform tasks of data intelligence component 30.
  • FIG. 1 B depicts an exemplary computer system 110 for masking treatment-affected regions to enhance the performance of a classifier.
  • Exemplary system 110 achieves such functionalities by implementing, on one or more computer devices, user input and output (I/O) module 120, memory or database 130, data processing module 140, data analysis module 150, classification module 160, network communication module 170, and any other functional modules that may be needed for carrying out a particular task (e.g., an error correction or compensation module, a data compression module, etc.).
  • user I/O module 120 may further include an input sub-module, such as a keyboard, and an output submodule, such as a display (e.g., a printer, a monitor, or a touchpad).
  • all functionalities may be performed by one computer system.
  • the functionalities are performed by more than one computer system.
  • a particular task may be performed by implementing one or more functional modules.
  • each of the enumerated modules itself may, in turn, include multiple sub-modules.
  • data processing module 140 may include a sub-module for data quality evaluation (e.g., for discarding very short sequence reads or sequence reads including obvious errors), a sub-module for normalizing numbers of sequence reads that align to different regions of a reference genome, a sub-module to compensate/correct guanine-cytosine (GC) biases, a sub-module for matching data associated with a cancer sample with other data associated with one or more non-cancer samples, etc.
  • data quality evaluation e.g., for discarding very short sequence reads or sequence reads including obvious errors
  • GC guanine-cytosine
  • a user may use I/O module 120 to manipulate data that is available either on a local device or can be obtained via a network connection from a remote service device or another user device.
  • I/O module 120 may allow a user, e.g., via a keyboard, a mouse, or a touchpad, to initiate or perform data analysis via a graphical user interface (GUI).
  • GUI graphical user interface
  • a user may manipulate data via voice control.
  • user authentication may be required before a user is granted access to the data being requested.
  • user I/O module 120 may be used to manage various functional modules. For example, a user may request via user I/O module 120 input data while an existing data processing session is in process.
  • a user may do so by selecting a menu option or type in a command discretely without interrupting the existing process.
  • a user may utilize user I/O module 120 to set various thresholds, configure sample matching settings, and/or provide other instructions to computer system 110 that dictate how treatment- affected regions are identified and/or masked.
  • a user may use any type of input to direct and control data processing and analysis via I/O module 120.
  • system 110 further comprises a memory or database 130.
  • database 130 comprises a local database that may be accessed via user I/O module 120.
  • database 130 comprises a remote database that may be accessed by user I/O module 120 via network connection.
  • database 130 is a local database that stores data retrieved from another device (e.g., a user device or a server).
  • memory or database 130 may store data retrieved in real-time from internet searches.
  • database 130 may send data to and receive data from one or more of the other functional modules, including, but not limited to, a data collection module (not shown), data processing module 140, data analysis module 150, classification module 160, network communication module 170, and etc.
  • a data collection module not shown
  • data processing module 140 data processing module 140
  • data analysis module 150 data analysis module 150
  • classification module 160 classification module 160
  • network communication module 170 network communication module 170
  • pre-treatment and posttreatment sample data may be stored on database 130.
  • database 130 may be a database local to the other functional modules.
  • database 130 may be a remote database that may be accessed by the other functional modules via wired or wireless network connection (e.g., via network communication module 170).
  • database 130 may include a local portion and a remote portion.
  • system 110 comprises a data processing module 140.
  • Data processing module 140 may receive data from I/O module 120 or database 130.
  • data processing module 140 may perform standard data processing algorithms, such as one or more of noise reduction, signal enhancement, normalization of counts of sequence reads, correction of GC bias, etc.
  • data processing module 140 may be configured to identify features in pre-treatment and post-treatment DNA methylation data.
  • computer system 110 may be able to identify one or more differentially methylated regions (DM Rs), which are regions where DNA methylation varies significantly between different biological samples, e.g., a pre-treatment and a post-treatment sample. From these identified features, data processing module 140 may be configured to identify one or more treatment-affected features and subsequently remove them from consideration in one or more downstream processes (e.g., feature selection, classifier training, etc.).
  • DM Rs differentially methylated regions
  • system 110 comprises a data analysis module 150.
  • data analysis module 150 includes identifying and treating systematic errors in sequencing data, as described in connection with data processing module 140.
  • system 110 comprises a classification module 160, which may embody a “machine-learning model” or “trained classifier.”
  • a “machine-learning model” or “trained classifier” generally encompasses instructions, data, and/or a model configured to receive input, and apply one or more of a weight, bias, classification, or analysis on the input to generate an output.
  • the output may include, for example, a classification of the input, an analysis based on the input, a design, process, prediction, or recommendation associated with the input, or any other suitable type of output.
  • a machine-learning model is generally trained using training data, e.g., experiential data and/or samples of input data, which are fed into the model in order to establish, tune, or modify one or more aspects of the model, e.g., the weights, biases, criteria for forming classifications or clusters, or the like.
  • aspects of a machine-learning model may operate on an input linearly, in parallel, via a network (e.g., a neural network), or via any suitable configuration.
  • the machine-learning model may be trained on a combination of real and synthetic sample data.
  • the execution of the machine-learning model may include deployment of one or more machine-learning techniques, such as k-nearest neighbors, linear regression, logistic regression, random forest, gradient boosted machine (GBM), deep learning, a deep neural network, and/or any other suitable machine-learning technique that solves problems in the field of Natural Language Processing (NLP).
  • machine-learning techniques such as k-nearest neighbors, linear regression, logistic regression, random forest, gradient boosted machine (GBM), deep learning, a deep neural network, and/or any other suitable machine-learning technique that solves problems in the field of Natural Language Processing (NLP).
  • Supervised, semi-supervised, and/or unsupervised training may be employed.
  • supervised learning may include providing training data and labels corresponding to the training data, e.g., as ground truth.
  • Unsupervised approaches may include clustering, classification or the like.
  • K-means clustering or K-Nearest Neighbors may also be used, which may be supervised or unsupervised.
  • Combinations of K-Nearest Neighbors and an unsupervised cluster technique may also be used. Any suitable type of training may be used, e.g., stochastic, gradient boosted, random seeded, recursive, epoch or batch-based, etc.
  • a machine-learning model may be trained to analyze data from a test sample from a test subject whose status with respect to a medical condition is unknown and subsequently classifies the unknown test sample from the test subject based on the likelihood of the subject fitting into a particular category.
  • the one or more parameters may include a binomial probability score that is calculated based on logistic regression analysis.
  • the binomial probability score may correspond to the likelihood of a subject having a certain medical condition, such as cancer (e.g., NSCLC).
  • a score of over a predefined threshold may indicate that the subject associated with a test sample is more likely to have cancer than not have cancer.
  • the one or more parameters may include a sequencing or methylation data distribution pattern correlating with the presence of cancer.
  • a subject associated with a test sample having sequencing or methylation data with a pattern resembling the cancer pattern to a sufficient degree may be predicted as having cancer.
  • a sequencing or methylation data distribution pattern may be identified in connection with a specific type of cancer, determining a tissue of origin or cancer signal origin, thus allowing a test sample to be classified as indicative of a certain cancer type.
  • network communication module 170 may be used to facilitate communications between a user device, one or more databases, and any other suitable system or device through a wired or wireless network connection.
  • Any communication protocol/device may be used, including, without limitation, a modem, an Ethernet connection, a network card (wireless or wired), an infrared communication device, a wireless communication device, and/or a chipset (such as a BluetoothTM device, an 802.11 device, a WiFi device, a WiMax device, cellular communication facilities, etc.), a near-field communication (NFC), a Zigbee communication, a radio frequency (RF) or radio-frequency identification (RFID) communication, a PLC protocol, a 3G/4G/5G/LTE based communication, and/or the like.
  • a user device having a user interface platform for processing/analyzing tumor fraction data may communicate with another user device with the same platform, a regular user device without the same platform (e.g., a regular smartphone), a remote server, a physical device of a remote loT local network, a wearable device, a user device communicably connected to a remote server, and etc.
  • a regular user device without the same platform e.g., a regular smartphone
  • a remote server e.g., a regular smartphone
  • a remote server e.g., a physical device of a remote loT local network
  • a wearable device e.g., a wearable device communicably connected to a remote server, and etc.
  • an exemplary workflow 200 is provided for masking treatment-affect regions in training data that may be provided to a machine learning classifier. Aspects of the exemplary workflow 200 may be performed in accordance with some or all components described in FIG. 1 A and 1 B.
  • predefined significance thresholds may be established. These thresholds may include p-values and adjusted p-values
  • these treatment-affected features may exhibit a specific methylation pattern.
  • a single feature may be defined as a set of a predefined number of consecutive CpG sites (e.g., 5 consecutive CpG sites). If this predefined number of consecutive CpG sites exhibit methylation states associated with previously identified treatment-affected regions (e.g., all of the CpG sites in the set are methylated) then this may provide an indication that the relevant feature is treatment affected. In general, methyl variants active in on-treatment subjects but not pre-treatment subjects are sought.
  • tumor fraction monitoring may be utilized to help distinguish whether a genomic feature is the result of cancer treatment or is indicative of a new cancer.
  • a baseline tumor fraction i.e. , a reference point
  • the tumor fraction may be monitored over time. If the tumor fraction remains relatively stable or decreases during or after treatment, but specific features emerge or change significantly in post-treatment samples, it may suggest that these features are related to the treatment process. Stated differently, treatment- affected features may be expected to appear at a frequency or level that is inconsistent with the tumor fraction.
  • computer system 110 may assign a score (e.g., represented by a numeral, a percentage, etc.) to each identified feature.
  • the score may be representative of the determinations made by computer system 110 that the feature is either resultant from a treatment or is cancer-derived.
  • computer system 110 may compare the score against a predetermined threshold (e.g., a treatment-affected feature determination threshold) that may be established to delineate between treatment-affected features and cancer-related features. More particularly, those features having a score above the threshold may be considered to be associated with cancer whereas those features having a score below the threshold may be considered to be treatment-derived, or vice versa.
  • a predetermined threshold e.g., a treatment-affected feature determination threshold
  • each treatment-affected feature may be identified and masked before the formal feature selection process (i.e., in which the most informative features that contribute to accurate cancer detection are identified) to prevent their influence on a cancer-detecting classifier.
  • the target methyl variant regions may be excluded from the dataset used to train the classifier. These regions are effectively removed from the training data to allow the classifier to focus on the CpGs outside of the treatment-affected regions.
  • additional steps may be taken to correct this bias.
  • weights associated with these features may be minimized or set to zero. This may cause a classifier to focus more on correctly classifying non-treatment affected regions while giving less importance to the treatment-affected regions.
  • the weighting process may be sample based. For instance, if individual samples are associated with different degrees of treatment effect (i.e., some samples are more strongly affected by treatment than others), then different weights may be assigned to each sample. For example, samples from subjects with minimal treatment effects may receive higher weights, while those from subjects with significant treatment effects may receive lower weights. In other aspects, any samples from subjects with any treatment effects may receive zero weight. In still other aspects, all data from a sample showing treatment effects may be weighted less or given zero weight, or only the portion of the sample data showing treatment effects may be weighted less or given zero weight.
  • the masking process may be initially implemented within a two-fold cross-validation approach.
  • cross-validation the dataset is divided into two subsets: a training subset and a validation subset.
  • the classifier may be trained on one subset and validated on the other.
  • the subsets may be swapped to promote comprehensive training and validation.
  • diagram 300 provides a schematic diagram of treatment-affected feature identification and removal.
  • Section 305 presents a first list of pre-treatment features and a second list of post-treatment features.
  • the pretreatment features may be present in the samples of treatment-naive subjects whereas the post-treatment features may be present in the samples of on-treatment subjects.
  • each feature set may contain features that are associated with different diseases (e.g., adenocarcinoma and squamous cell carcinoma (SCC)). More particularly, different diseases may have certain features that are shared and some features that are distinct.
  • diseases e.g., adenocarcinoma and squamous cell carcinoma (SCC)
  • SCC squamous cell carcinoma
  • adenocarcinoma and SCC may each have a subset of features associated with their distinctive disease but may also have a plurality of shared features that are found in both diseases.
  • some features between the pre-treatment features list and the post-treatment features list may be different, which is resultant from the implementation of the therapy.
  • the list of post-treatment features may also contain a plurality of features associated with confounding treatment effects. Specifically, these are treatment-affected features that originate as a result of being a byproduct of a certain cancer treatment (e.g., CRT).
  • components of system 110 may be configured to identify and computationally mask these treatment-affected features, as previously described above, prior to the feature selection process at 310.
  • a subset of the total features presented in section 305 may remain, as presented in section 315. In some situations, each treatment-affected feature may be correctly identified and masked by system 110 prior to feature selection 310.
  • a subset of treatment-affected features may remain. These instances may arise, e.g., when computer system 100 retains a feature because it cannot confidently conclude that the feature is exclusively derived from a treatment effect, rather than being associated with cancer status (e.g., as a result of sample size limitations). For example, a score assigned to a particular feature may fall into a designated “buffer” range situated around a threshold (e.g., where features having scores above the threshold are considered to be cancer- related and features having scores below the threshold are considered to be originate from a treatment). Computer system 100 may be configured to conclude that scores within this “buffer” range are too close to classify either way and may be configured to maintain these features in the selected feature pool.
  • components of system 110 may be configured to weigh some or all of the remaining treatment-affected features lower (e.g., assigned weights close to or at zero) than other features so that when a classifier model is trained on the selected features at 320, the impact of any treatment-affected feature is reduced.
  • section 325 illustrates that out of the two treatment-affected features that survived masking and feature selection (i.e. , features 32 and 34), feature 32 was lower weighted than feature 34.
  • the majority of them were, which should help to improve classifier performance.
  • computer system 110 may further filter the selection of features that the classifier is trained on based on the identification that there is low noise associated with the feature. More particularly, the lack of noise may provide an indication that the methylation pattern exhibited by the CpG sites associated with the feature are rare in non-cancer cfDNA.
  • FIG. 4 an exemplary workflow for identifying and masking treatment-affected regions in the genome is disclosed.
  • the exemplary workflow may be performed, e.g., by components of computer systems 100, 110 (shown in FIG. 1 A and 1 B).
  • computer system 110 may receive a first and second set of sequencing data that is associated with a pre-treatment and a post-treatment sample, respectively.
  • each pre-treatment sample may be associated with a treatment-naTve subject that has not undergone any type of therapy or treatment for a disease condition, or has not undergone any type of relevant therapy or treatment that may give rise to treatment-related features.
  • each posttreatment sample may be associated with an on-treatment subject that has received some type of therapy for their disease (e.g., CRT).
  • each of the first and second set of sequencing data may be DNA methylation data derived, e.g., from a WGBS or cfDNA approach.
  • computer system 110 may be configured to compare a first feature set in the first set of sequencing data to a second feature set in the second set of sequencing data. More particularly, in an aspect, methylation analysis may be conducted on each set of sequencing data. This analysis may be conducted to identify the methylation level at each CpG site (e.g., which may be represented by a “beta value” that ranges from 0 to 1 ). The results of the methylation analysis may be utilized to identify a plurality of features in the pre- and post-treatment methylation data sets that are representative of the genomic information conveyed therein and/or that most distinguish each data set from the other.
  • At step 415 at least one treatment-affected feature in the second feature set of the second set of sequencing data may be determined based on the comparison conducted at step 410. More particularly, in an aspect, the at least one treatment-affected feature may be identified as a genomic region that contains a specific methylation pattern (e.g., a methylation pattern that is emblematic of previously observed methylation patterns that originated in response to a treatment) and that was not also present in the first feature set. The coordinates of the genomic regions associated with the treatment-affected region(s) may be flagged and communicated to computer system 110. In another aspect, the treatment-affected feature may correspond to a genomic region containing a methylation pattern that is present in the first feature set but is absent in the second feature set.
  • a specific methylation pattern e.g., a methylation pattern that is emblematic of previously observed methylation patterns that originated in response to a treatment
  • the coordinates of the genomic regions associated with the treatment-affected region(s) may be flagged and communicated
  • a methylation state of one or more CpGs sites associated with a feature in the first feature set may be affected so that the feature no longer appears as such in the second feature set. For instance, this may result when an administered treatment eliminated a normally shedding population of non-cancer cells, thereby causing a feature associated with those noncancer cells in the untreated population to be absent in the treated population.
  • computer system 110 may be configured to implement an exclusion process on the second feature set. More particularly, computer system 110 may leverage knowledge of the genomic locations associated with the treatment- affected regions to remove or computationally mask the treatment-affected region(s) from the second feature so that they are not considered or utilized in any type of downstream processes (e.g., formal feature selection, classifier training, etc.).
  • computer system 110 may determine whether any treatment-affected regions still remain.
  • computer system 110 may apply a weighting operation to the features wherein the treatment-affected features are down-weighted (e.g., to zero or close to zero) so that their subsequent impact on classifier training is minimized.
  • a classifier of the described embodiments may be leveraged to generate a quantitative estimate of the tumor fraction that is robust to other various biological processes occurring within a subject, e.g., inflammation, cells dying, etc.
  • a classifier of the described embodiments may be capable of improving the limit of detection (LOD) for MRD analysis. More particularly, by masking treatment-affected regions in cfDNA methylation patterns, the MRD analysis may be more specific to cancer- related methylation changes. This improved specificity may inhibit the occurrence of false-positive results, meaning that subjects are not incorrectly identified as having residual disease when they do not, thereby improving LOD in terms of precision.
  • LOD limit of detection
  • computer system 110 may receive a first and second set of sequencing data that is associated with a first sample associated with a first treatment condition and a second sample associated with a second treatment condition, respectively.
  • the first sample associated with the first treatment condition may be associated with an on-treatment subject that has received a first type of therapy for their disease (e.g., CRT), and the second sample associated with the second treatment condition may also be associated with an on-treatment subject that has also received the first type of therapy for their disease, as well as a second type of therapy (e.g., immunotherapy, hormone therapy, etc.).
  • a first type of therapy for their disease e.g., CRT
  • a second type of therapy e.g., immunotherapy, hormone therapy, etc.
  • the first sample associated with the first treatment condition may be associated with an on- treatment subject that has received a therapy for their disease at a first time (e.g., X amount of days prior to sample collection) and the second sample associated with the second treatment condition may be associated with an on-treatment subject that has received the same therapy for their disease at a second time (e.g., X+Y amount of days prior to sample collection).
  • a therapy for their disease at a first time
  • the second sample associated with the second treatment condition may be associated with an on-treatment subject that has received the same therapy for their disease at a second time (e.g., X+Y amount of days prior to sample collection).
  • any process discussed in this disclosure that is understood to be computer-implementable may be performed by one or more processors of a computer system, such as system environment 110, as described above.
  • a process or process step performed by one or more processors may also be referred to as an operation.
  • the one or more processors may be configured to perform such processes by having access to instructions (e.g., software or computer-readable code) that, when executed by the one or more processors, cause the one or more processors to perform the processes.
  • the instructions may be stored in a memory of the computer server.
  • a processor may be a central processing unit (CPU), a graphics processing unit (GPU), or any suitable types of processing unit.
  • a computer system such as system environment 110, may include one or more computing devices. If the one or more processors of the computer system are implemented as a plurality of processors, the plurality of processors may be included in a single computing device or distributed among a plurality of computing devices. If a system environment comprises a plurality of computing devices, the memory of the computer system may include the respective memory of each computing device of the plurality of computing devices.
  • FIG. 5 is a simplified functional block diagram of a computer system 500 that may be configured as a computing device for executing the processes described herein, according to exemplary embodiments of the present disclosure.
  • FIG. 5 is a simplified functional block diagram of a computer that may be configured according to exemplary embodiments of the present disclosure.
  • any of the systems herein may be an assembly of hardware including, for example, a data communication interface 520 for packet data communication.
  • the platform also may include a central processing unit (“CPU”) 502, in the form of one or more processors, for executing program instructions.
  • CPU central processing unit
  • the platform may include an internal communication bus 508, and a storage unit 506 (such as ROM, HDD, SDD, etc.) that may store data on a computer readable medium 522, although the system 500 may receive programming and data via network communications via electronic network 525 (e.g., voice, video, audio, images, or any other data over the electronic network 525).
  • the system 500 may also have a memory 504 (such as RAM) storing instructions 524 for executing techniques presented herein, although the instructions 524 may be stored temporarily or permanently within other modules of system 500 (e.g., processor 502 and/or computer readable medium 522).
  • the system 500 also may include input and output ports 512 and/or a display 510 to connect with input and output devices such as keyboards, mice, touchscreens, monitors, displays, etc.
  • input and output ports 512 and/or a display 510 to connect with input and output devices such as keyboards, mice, touchscreens, monitors, displays, etc.
  • the various system functions may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load. Alternatively, the systems may be implemented by appropriate programming of one computer hardware platform.
  • the term “based on” means “based at least in part on.”
  • the singular forms “a,” “an,” and “the” include plural referents unless the context dictates otherwise.
  • the term “exemplary” is used in the sense of “example” rather than “ideal.”
  • the terms “comprises,” “comprising,” “includes,” “including,” or other variations thereof, are intended to cover a non-exclusive inclusion such that a process, method, or product that comprises a list of elements does not necessarily include only those elements, but may include other elements not expressly listed or inherent to such a process, method, article, or apparatus.
  • the term “user” generally encompasses any person or entity, such as a researcher and/or a care provider (e.g., a doctor, etc.), that may desire information, resolution of an issue, or engage in any other type of interaction with a provider of the systems and methods described herein (e.g., via an application interface resident on their electronic device, etc.).
  • a care provider e.g., a doctor, etc.
  • the term “electronic application” or “application” may be used interchangeably with other terms like “program,” or the like, and generally encompasses software that is configured to interact with, modify, override, supplement, or operate in conjunction with other software.
  • Storage type media include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks.
  • Such communications may enable loading of the software from one computer or processor into another, for example, from a management server or host computer of the mobile communication network into the computer platform of a server and/or from a server to the mobile device.
  • another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
  • the physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software.
  • terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Zoology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Pathology (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Wood Science & Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • General Engineering & Computer Science (AREA)
  • Hospice & Palliative Care (AREA)
  • Microbiology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Biochemistry (AREA)
  • Evolutionary Computation (AREA)
  • Oncology (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Des systèmes et des procédés de la divulgation peuvent consister à recevoir, au niveau d'un dispositif informatique, un premier ensemble de données de méthylation d'acide nucléique et un second ensemble de données de méthylation d'acide nucléique, le premier ensemble de données de méthylation d'acide nucléique étant associé à un échantillon de prétraitement et le second ensemble de données de méthylation d'acide nucléique étant associé à un échantillon de post-traitement. Le procédé peut consister à comparer, à l'aide d'un processeur du dispositif informatique, un premier ensemble de caractéristiques du premier ensemble de données de méthylation d'acide nucléique par rapport à un second ensemble de caractéristiques du second ensemble de données de méthylation d'acide nucléique ; à déterminer, sur la base de la comparaison, au moins une caractéristique affectée par le traitement dans le second ensemble de caractéristiques du second ensemble de données de méthylation d'acide nucléique ; et à mettre en œuvre, sur la base de la détermination, un processus d'exclusion sur la ou les caractéristiques affectées par le traitement dans le second ensemble de caractéristiques.
PCT/US2024/060122 2023-12-14 2024-12-13 Systèmes et procédés de masquage de régions affectées par un traitement dans le génome pour améliorer les performances d'un classificateur Pending WO2025129061A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363610143P 2023-12-14 2023-12-14
US63/610,143 2023-12-14

Publications (1)

Publication Number Publication Date
WO2025129061A1 true WO2025129061A1 (fr) 2025-06-19

Family

ID=94117204

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/060122 Pending WO2025129061A1 (fr) 2023-12-14 2024-12-13 Systèmes et procédés de masquage de régions affectées par un traitement dans le génome pour améliorer les performances d'un classificateur

Country Status (1)

Country Link
WO (1) WO2025129061A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2021245992A1 (en) * 2020-03-31 2022-11-10 Freenome Holdings, Inc. Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis
US20230272486A1 (en) * 2022-02-17 2023-08-31 Grail, Llc Tumor fraction estimation using methylation variants

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2021245992A1 (en) * 2020-03-31 2022-11-10 Freenome Holdings, Inc. Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis
US20230272486A1 (en) * 2022-02-17 2023-08-31 Grail, Llc Tumor fraction estimation using methylation variants

Similar Documents

Publication Publication Date Title
Quazi Artificial intelligence and machine learning in precision and genomic medicine
JP7689557B2 (ja) 相同組換え欠損を推定するための統合された機械学習フレームワーク
CN112740239B (zh) 转录因子分析
US11972870B2 (en) Systems and methods for predicting patient outcome to cancer therapy
CN112289455A (zh) 一种人工智能神经网络学习模型构建系统、构建方法
WO2018223066A1 (fr) Méthodes et systèmes permettant d'identifier ou de surveiller une maladie pulmonaire
WO2021258026A1 (fr) Détection de réponse et progression moléculaire à partir d'adn acellulaire circulant
US20240120096A1 (en) Computational Method And System For Diagnostic And Therapeutic Prediction From Multimodal Data
JP2024500881A (ja) 微生物核酸および体細胞変異を用いたタキソノミー独立型の癌診断および分類
Rajalaxmi et al. A systematic review of lung cancer prediction using machine learning algorithm
WO2025129061A1 (fr) Systèmes et procédés de masquage de régions affectées par un traitement dans le génome pour améliorer les performances d'un classificateur
Hobbs et al. Biostatistics and bioinformatics in clinical trials
Gupta et al. Genome Sequence Identification using Deep Learning for Lung Cancer Diagnosis
WO2023129687A1 (fr) Modèle de classification multiclasses et schéma de classification multiniveaux pour la détermination complète de la présence et du type de cancer sur la base d'une analyse d'informations génétiques et systèmes pour sa mise en œuvre
Fardin et al. Identification of Multiple Hypoxia Signatures in Neuroblastoma Cell Lines by l1‐l2 Regularization and Data Reduction
Zhang et al. Radio-iBAG: Radiomics-based integrative Bayesian analysis of multiplatform genomic data
WO2025155784A1 (fr) Systèmes et procédés pour identifier des signatures de méthylation associées à l'hématopoïèse clonale
US20250037876A1 (en) Systems and methods for developing and utilizing a hematologic prognostic classifier
US20240038335A1 (en) Systems and methods for detecting disease subtypes
US12500000B2 (en) Systems and methods for predicting patient outcome to cancer therapy
Shi et al. Omics-based evaluation of m6A modification patterns in uveal melanoma and their prognostic implications
Alex et al. Cancer diagnosis using deep learning
WO2025217381A1 (fr) Systèmes et procédés de construction et d'utilisation de classificateur de trouble de plasmocyte pour effectuer une analyse de caractéristiques informée
Vincentina Mary et al. Integrating HRPGW optimization with ALSTM for enhanced cancer diagnosis in gene expression microarray data analysis.
Wankhede et al. Deep Neural Network-Based Risk Prediction of Glioblastoma Multiforme Recurrence

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24829001

Country of ref document: EP

Kind code of ref document: A1