WO2025235365A1 - Methods and systems for improved methylation sequencing - Google Patents
Methods and systems for improved methylation sequencingInfo
- Publication number
- WO2025235365A1 WO2025235365A1 PCT/US2025/027719 US2025027719W WO2025235365A1 WO 2025235365 A1 WO2025235365 A1 WO 2025235365A1 US 2025027719 W US2025027719 W US 2025027719W WO 2025235365 A1 WO2025235365 A1 WO 2025235365A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cancer
- conversion
- nucleic acid
- methylation
- resistant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Definitions
- Nucleic acid methylation can represent tumor characteristics and phenotypic states and has potential for use in early disease detection and/or diagnosis as well as personalized medicine.
- deoxyribonucleic acid (DNA) methylation abnormalities may be associated with various stages of cancer, from tumor initiation to cancer progression and metastasis. Aberrant DNA methylation patterns occur early in the pathogenesis of cancer and can provide a mechanism for early cancer detection. These properties enable the use of DNA methylation patterns for cancer diagnosis.
- the present disclosure provides methods and systems for nucleic acid library preparation for methylation sequencing which increases library yield and reduces undesired PCR products thereby minimizing signal loss and reducing biases that can be introduced in standard library preparation methods. Further, the presently disclosed library preparation methods and systems may improve the quality and accuracy of nucleic acid methylation sequencing and uses thereof, for example, in detection of disease. More accurate and complete information regarding methylation state permits higher quality feature generation for use in machine learning models and classifier generation.
- the present disclosure provides methods of preparing a sequencing library for methylation sequencing of one or more nucleic acid molecules of a biological sample or a derivative thereof, comprising: (a) obtaining a nucleic acid composition, wherein the nucleic acid composition comprises a plurality of double-stranded nucleic acid molecules obtained or derived from the biological sample; (b) ligating a conversion-resistant nucleic acid adapter to a doublestranded nucleic acid molecule of the plurality of double-stranded nucleic acid molecules to generate a conversion-resistant adapter-ligated nucleic acid molecule, wherein the nucleic acid adapter comprises modified cytosines that are resistant to base conversion by a methylation conversion method; and (c) subjecting the conversion-resistant adapter-ligated nucleic acid molecule to conditions sufficient to convert unmethylated cytosines to uracils using the methylation conversion method, thereby generating a converted conversion-resistant adapter- ligated nucleic acid molecule
- the ligating in (b) further comprises treating with a deoxyribonucleic acid (DNA) ligase.
- DNA deoxyribonucleic acid
- the conversion-resistant nucleic acid adapter comprises one or more modified cytosines that are deamination-resistant.
- the conversion-resistant nucleic acid adapter comprises one or more modified cytosine bases selected from a group consisting of propynyl-C, pyrrolo-C, and 5- methylcytosine (5mC).
- the one or more modified cytosines that are deamination-resistant are propynyl-C residues.
- the one or more modified cytosines that are deamination-resistant are pyrrolo-C residues.
- the conversion-resistant nucleic acid adapter comprises one or more 5mC residues.
- the conversion-resistant nucleic acid adapter does not comprise methylated cytosine bases or unmethylated cytosine bases.
- the method further comprises amplifying the converted conversion-resistant adapter-ligated nucleic acid molecule.
- the amplifying comprises polymerase chain reaction (PCR).
- the method further comprises determining a nucleic acid sequence of the amplified adapter-ligated nucleic acid molecule or derivative thereof.
- the method further comprises sequencing the amplified adapter- ligated nucleic acid molecule or derivative thereof to generate sequencing data.
- the method further comprises analyzing the sequencing data to generate a methylation profile of the nucleic acid molecule of the biological sample or derivative thereof.
- the analyzing further comprises comparing the sequencing data to a reference sequence.
- the nucleic acid molecule comprises deoxyribonucleic acid (DNA).
- the DNA comprises cell-free DNA.
- the biological sample is a cell-free biological sample.
- the cell-free biological sample is a plasma sample.
- the methylation conversion method comprises a minimally- destructive conversion method treatment with one or more enzymes.
- the minimally-destructive conversion method comprises treatment with a ten eleven translocation (TET) enzyme.
- TET translocation
- the treatment with TET enzyme comprises providing reaction conditions comprising 1X-18X freshly-diluted Fe2+.
- the minimally-destructive conversion method does not comprise treatment with bisulfite.
- the present disclosure provides methods for performing methylation sequencing of a cell -free deoxyribonucleic acid (cfDNA) sample from a subject, comprising: (a) extracting the cfDNA sample from the subject, wherein said cfDNA sample comprises doublestranded nucleic acid (dsDNA) molecules; (b) ligating conversion-resistant adapters comprising one or more deamination-resistant modified cytosines to the dsDNA molecules, wherein the dsDNA molecules comprise one or more unmethylated cytosine residues; (c) enzymatically converting the one or more unmethylated cytosine residues to uracils in the dsDNA molecules ligated to the conversion-resistant adapters; (d) amplifying the converted dsDNA molecules comprising the conversion-resistant adapters of (c) to produce amplified dsDNA molecules; and (e) determining a nucleic acid sequence of the amplified dsDNA molecules comprising the conversion-resistant adapters
- the method further comprises: (f) determining a methylation profile of the cfDNA sample from the subject based at least in part on (e); (g) classifying by a trained machine learning algorithm the methylation profile of the cfDNA sample as indicative of a presence of a cancer in the subject; and (h) outputting a report that identifies the cfDNA sample as negative for the cancer if the trained machine learning algorithm classifies the cfDNA sample as negative for the cancer at a specified confidence level.
- the determining the methylation profile comprises performing a hypermethylation analysis.
- the cancer comprises two or more of colorectal cancer, breast cancer, pancreatic cancer, liver cancer, or lung cancer.
- the cancer comprises colorectal cancer, breast cancer, pancreatic cancer, liver cancer, or lung cancer.
- the cancer comprises the colorectal cancer.
- the cancer comprises the lung cancer. [33] In some embodiments, the cancer comprises the liver cancer.
- the cancer comprises the pancreatic cancer.
- the method further comprises: (f) determining a baseline methylation profile of the cfDNA sample of the subject at a baseline methylation state based at least in part on (e); (g) determining a test methylation profile of a biological sample of the subject at one or more time points following the baseline methylation state of (f); and (h) determining a change in the test methylation profile as compared to the baseline methylation profile, wherein the change is indicative of a change in a minimal residual disease status of the subject.
- the minimal residual disease status is selected from the group consisting of response to a treatment, tumor load, residual tumor post-surgery, relapse, secondary screen, primary screen, and cancer progression.
- determining the baseline methylation profile comprises a hypermethylation analysis.
- FIG. 1 shows a hypothetical single-stranded deoxyribonucleic acid (ssDNA) template region (primer binding site) of a Y-adapter comprising methylated cytosines ligated to a cfDNA insert.
- ssDNA single-stranded deoxyribonucleic acid
- FIGs. 2A-2E provide conversion-resistant modified cytosines (FIGs. 2A-2C) which can be used in the conversion-resistant adapter design strategies tested disclosed herein.
- Specific deamination-resistant modified cytosines can include propynl-C, pyrrolo-C, and 5mC.
- FIG. 2E shows an adapter region comprising a minimal number of unmethylated cytosines being carried through oxidation and deamination, followed by PCR using a primer that anneals to the fully converted sequence.
- FIG. 3 provides a comparison of dsDNA library concentration (library yields) based upon differing types of conversion-resistant adapters and alternative adapter design types used during library preparation.
- 10 nanograms (ng) (Top graphs) or 3.5 ng (Bottom graphs) of contrived cfDNA was ligated the appropriate adapters. Material was then subjected to TET2- mediated oxidation with DTT (SOP; Left graphs) or without DTT (Right graphs) included in the reaction buffer prior to deamination by APOB EC.
- SOP Left graphs
- DTT Light graphs
- FIG. 4 provides a comparison of dsDNA library concentrations (library yields) between differing oxidation reactions (TET -2 -mediated reactions) comprising: (i) IX (final reaction concentration) pre-plated thawed Fe2+ (SOP) or (ii) 6X (final reaction concentration) freshly- diluted Fe2+ with 10 ng of input DNA.
- FIG. 5 provides a comparison of CpH protection from EM-seq reactions under differing oxidation conditions (e.g., final reaction Fe2+ concentrations): (i) IX concentrated pre-plated thawed Fe2+ (SOP); (ii) IX concentrated freshly-diluted Fe2+; (iii) 3X concentrated freshly- diluted Fe2+; (iv) 6X concentrated freshly-diluted Fe2+; (v) 9X concentrated freshly-diluted Fe2+; (vi) 12X concentrated freshly-diluted Fe2+; (vii) 18X concentrated freshly-diluted Fe2+; (viii) 36X concentrated freshly-diluted Fe2+; and (ix) 216X concentrated freshly-diluted Fe2+.
- SOP IX concentrated pre-plated thawed Fe2+
- FIG. 6 provides a comparison of CpH protection rates from EM-seq reactions under differing reaction conditions (e.g., final Fe2+ concentrations and master mix pH levels) with different reagent lots: (i) IX concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Left graph); (ii) 3X concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Middle graph); (iii) 6X concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Right graph).
- FIG. 7 shows a computer system that is programmed or otherwise configured to implement methods provided herein.
- Plasmid cell-free DNA may generally refer to DNA molecules that circulate in the acellular portion of blood. Circulating nucleic acids in blood may arise from necrotic or apoptotic cells indicative of disease, such as cancer. In cancer, circulating DNA bears hallmark signs of the disease, including mutations in oncogenes and microsatellite alterations. These circulating DNA may be referred to as circulating tumor DNA (ctDNA). Viral genomic sequences, DNA, or RNA in plasma is a potential biomarker for disease.
- the cell-free fraction of blood may be blood serum or blood plasma.
- the term “cell-free fraction” of a biological sample as used herein generally refers to a fraction of the biological sample that is substantially free of cells.
- the term “substantially free of cells” may generally refer to a preparation from the biological sample comprising fewer than about 20,000 cells per ml, fewer than about 2,000 cells per ml, fewer than about 200 cells per ml, or fewer than about 20 cells per ml.
- Genomic DNA gDNA refers to non-fragmented DNA that is released from white blood cells contaminating the blood cell-free fraction. To mitigate gDNA from contaminating samples, a highly controlled sample processing workflow may be implemented, and specimens may be screened against the presence of gDNA.
- nucleic acid generally refers to a polynucleotide comprising two or more nucleotides. It may be DNA or RNA. Nucleic acid may be genomic or derived from the genome of a eukaryotic or prokaryotic cell, or synthetic, cloned, amplified, or reverse transcribed. In certain embodiments of the methods and compositions, nucleic acid may refer to genomic DNA as the context requires. The nucleic acid may be a polymeric form of nucleotides of any length, either deoxyribonucleotides (dNTPs) or ribonucleotides (rNTPs), or analogs thereof.
- dNTPs deoxyribonucleotides
- rNTPs ribonucleotides
- Nucleic acids may have any three-dimensional structure, and may perform any function, known or unknown.
- Non-limiting examples of nucleic acids include deoxyribonucleic (DNA), ribonucleic acid (RNA), coding or non-coding regions of a gene or gene fragment, loci (locus) defined from linkage analysis, exons, introns, messenger RNA (mRNA), transfer RNA, ribosomal RNA, short interfering RNA (siRNA), short-hairpin RNA (shRNA), micro-RNA (miRNA), ribozymes, cDNA, recombinant nucleic acids, branched nucleic acids, plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, and primers.
- DNA deoxyribonucleic
- RNA ribonucleic acid
- coding or non-coding regions of a gene or gene fragment loci (locus) defined from linkage analysis, exons, intron
- a nucleic acid may comprise one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs. If present, modifications to the nucleotide structure may be made before or after assembly of the nucleic acid.
- the sequence of nucleotides of a nucleic acid may be interrupted by non-nucleotide components.
- a nucleic acid may be further modified after polymerization, such as by conjugation or binding with a reporter agent.
- a “variant” nucleic acid is a polynucleotide having a nucleotide sequence identical to that of its original nucleic acid except having at least one nucleotide modified, for example, deleted, inserted, or replaced, respectively. The variant may have a nucleotide sequence at least about 80%, 90%, 95%, or 99%, identity to the nucleotide sequence of the original nucleic acid.
- methylation conversion methods or “methylation enrichment methods” or “methylation conversion agents”, as used herein, generally refers to a method in which a nucleic acid molecule is subjected to conditions sufficient to convert unmethylated cytosines in the nucleic acid molecule to uracils. The methods are useful for differentiating methylated cytosines from unmethylated cytosines in a nucleic acid molecule.
- Methylation conversion methods or methylation conversion agents can include bisulfite conversion or bisulfite sequencing may be used for DNA methylation analysis. Bisulfite sequencing is a convenient and effective method of mapping DNA methylation to individual bases.
- methylation conversion methods or methylation conversion agents can include enzymatic methylation (EM) conversion.
- Enzymatic methylation conversion is mediated by non-destructive enzymatic reactions involving a ten-eleven translocation (TET) enzyme and a cytosinedeaminating enzyme (e.g., APOBEC) to convert unmethylated (but not methylated) cytosines to uracils.
- TET ten-eleven translocation
- APOBEC cytosinedeaminating enzyme
- Other embodiments such as Tet-assisted pyridine borane sequencing (TAPS) combine enzymatic reactions such as TET together with chemical treatments (e.g., pyridine borane).
- the terms “enzymatic methylation” or “enzymatic methyl” or “EM conversion” or “EM-seq” generally refers to a method in which a nucleic acid molecule is subjected to conditions sufficient to convert unmethylated cytosines in the nucleic acid molecule to uracils by treatment with one or more enzymes. In some cases, the method does not comprise treatment with bisulfite (e.g., chemical treatment).
- conversion-resistant adapter generally refers to oligonucleotides used as adapters for nucleic acid library preparation in which one or more of the cytosine (C) residues in the adapter have been replaced with a modified cytosine that is resistant to methylation conversion or alteration by a methylation enrichment method (e.g., a “deamination-resistant modified cytosine” or a “methylation conversion agent resistant nucleotides”).
- C cytosine
- the terms “deamination-resistant modified cytosine” or “methylation conversion agent resistant nucleotides” generally refers to one or more modified cytosine nucleotides in a conversion-resistant adapter that are not chemically or enzymatically altered by treatment with a methylation conversion agent to change the base pairing specificity of the nucleotide base.
- propynyl-C, pyrrolo-C, and 5- methylcytosine (5mC) are deamination-resistant modified cytosines which are conversion resistant nucleotides and can be included in the conversion-resistant adapters used in the library preparation methods disclosed herein in conjunction with sodium bisulfite or enzymatic methylation conversion.
- these deamination-resistant modified cytosines within the conversion-resistant adapters are not deaminated when exposed to sodium bisulfite or methylation conversion enzymes.
- methylcytosine dioxygenase generally refer to an enzyme that converts 5mC to 5hmC.
- methylcytosine dioxygenases include, e.g., ten eleven translocation (TET) enzymes, e.g., TET1, TET2, TET3, Naeglaria TET, and genetically engineered versions and/or variants thereof.
- TET2 is an example of a methylcytosine dioxygenase that oxidizes at least 90%, at least 92%, at least 94%, at least 96%, at least 98%, or at least 99% of all 5mC.
- the methylcytosine dioxygenase enzyme (e.g., TET2) requires cofactors for optimal performance of the oxidation reaction (e.g., conversion of 5mC to 5hmC, 5hmC to 5fC, and 5fC to 5caC).
- the cofactor is iron (e.g., Fe, Fe2+, or Fe3+), which can be added to the oxidation reaction alone or together with other reagents that can control the pH and the redox state of the oxidation reaction.
- the iron is Fe2+.
- the Fe2+ is provided as ferrous salt (e.g., ferrous sulfate or ferrous gluconate).
- the Fe2+ can be stored frozen at up to lOx the reaction concentration (e.g. 400 pM if the final concentration is 40 pM) for various times ranging from 1 hour to 1 year prior to use.
- the concentrated frozen Fe2+ is allowed to thaw, then freshly diluted to a predetermined stock concentration. The freshly diluted Fe2+ can then be added to the reaction site (e.g., well) in an amount to achieve the desired final reaction concentration of Fe2+ together with the oxidation reaction reagents and sample to initiate the oxidation reaction.
- the Fe2+ is freshly prepared at a desired stock concentration. The freshly prepared Fe2+ can then be added to the reaction site (e.g., well) in an amount to achieve the desired final reaction concentration of Fe2+ together with oxidation reaction reagents and sample to initiate the oxidation reaction.
- Fe2+ is plated in the reaction site (e.g., well) at a desired final reaction concentration (“pre-plated Fe2+”) and then frozen.
- the plated Fe2+ can be stored frozen for various times ranging from 1 hour to 1 year prior to use. Prior to use, the pre-plated Fe2+ is allowed to thaw and the oxidation reaction reagents and sample are then added to the reaction site to initiate the oxidation reaction.
- the Fe2+ is added in an amount to have a final reaction concentration from (IX) to about (36X) the concentration recited the standard protocol (SOP). In certain embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (IX) to about (18X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (IX) to about (12X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration of from about (IX) to about (9X) the concentration recited the standard protocol (SOP).
- the Fe2+ is added in an amount to have a final reaction concentration from about (IX) to about (6X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (IX) to about (3X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (3X) to about (18X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (3X) to about (12X) the concentration recited the standard protocol (SOP).
- the Fe2+ is added in an amount to have a final reaction concentration from about (3X) to about (9X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (6X) to about (9X) the concentration recited the standard protocol (SOP).
- the pH of the oxidation buffer is assessed, then that pH is used to vary the concentration of Fe2+ used in the reaction. In some embodiments, the oxidation state of the other reaction buffers is assessed to vary the concentration of Fe2+ used in the reaction. In some embodiments, the oxidation state of the iron is assessed to vary the concentration of iron used in the reaction. [62] In some embodiments, the pH of the oxidation reaction is between about 7.5 to about 8.0. In some embodiments, the pH of the oxidation reaction is between about 7.6 to about 7.9. In some embodiments, the pH of the oxidation reaction is between about 7.75 to about 7.85. In other embodiments, the pH of the oxidation reaction is between about 7.5 and about 7.8.
- cytidine deaminase generally refers to an enzyme that deaminates cytosine (C) to form uracil (U).
- Non-limiting examples of cytidine deaminases include the apolipoprotein B mRNA-editing enzyme, catalytic polypeptide (APOBEC) family of cytidine deaminases, such as APOBEC3A.
- APOBEC catalytic polypeptide
- a cytidine deaminase described herein may have an amino acid sequence that is at least 90% identical to (e.g., at least 95% identical to) the amino acid sequence of GenBank accession number AKE33285.1, which is the sequence of human APOBEC3A.
- a cytidine deaminase described herein converts unmodified cytosine to uracil with an efficiency of at least 95%, 98% or 99%, or at least 99%.
- glucosyltransferase generally refer to an enzyme that catalyzes the transfer of a beta-D-glucosyl or alpha-D-glucosyl residue from UDP-glucose to 5hmC residue to form 5ghmC.
- APOBEC can convert 5hmC to U at a low rate relative to converting C or 5mC to U.
- An example of a GT is T4-betaGT (PGT).
- GT may be used concurrently with a dioxygenase.
- GT may be used together with dioxygenase in the same reaction mix with DNA such that the dioxygenase converts 5mC to 5hmC and 5caC, and the GT converts any residual 5hmC to 5ghmC to ensure only cytosine is deaminated.
- comparing generally refers to analyzing two or more sequences relative to one another. In some cases, comparing may be performed by aligning two or more sequences with one another such that correspondingly positioned nucleotides are aligned with one another.
- reference sequence generally refers to the sequence of a fragment that is being analyzed.
- a reference sequence may be obtained from a public database or may be separately sequenced as part of an experiment. In some cases, the reference sequence may be hypothetical such that the reference sequence may be computationally deaminated (e.g., to change Cs into Us or Ts etc.) to allow a sequence comparison to be made.
- G As used herein, the terms “G”, “A”, “T”, “U”, “C”, “5mC”, “5fC”, “5caC”, “5hmC”, and “5ghmC” generally refer to nucleotides that contain guanidine (G), adenine (A), thymine (T), uracil (U), cytosine (C), 5-methylcytosine, 5-formylcytosine, 5-carboxylcytosine (5caC), 5- hydroxymethylcytosine, and 5-glucosylhydroxymethylcytosine, respectively.
- G guanidine
- A adenine
- T thymine
- U uracil
- C cytosine
- 5-methylcytosine 5-formylcytosine
- 5-carboxylcytosine 5caC
- 5-hydroxymethylcytosine 5-glucosylhydroxymethylcytosine
- MRD minimal residual disease
- MRD testing may be performed to determine whether the cancer treatment is working and to guide further treatment plans.
- Various metrics can be used to assess MRD, including, but not limited to, response to treatment, tumor load, residual tumor post-surgery, relapse, secondary screen, primary screen, and cancer progression.
- next Generation Sequencing generally applies to sequencing libraries of genomic fragments of a size of less than 1 kb.
- the terms “detect”, “detecting”, or “detection” of a status or outcome generally includes detecting the presence of an indication (such as cancer), detecting status or outcome, or detecting predisposition to a status or outcome.
- the terms “diagnose” or “diagnosis” of a status or outcome generally includes predicting or diagnosing the status or outcome, determining predisposition to a status or outcome, monitoring treatment of patient, diagnosing a therapeutic response of a patient, prognosis of status or outcome, progression, and response to particular treatment.
- the terms “healthy” or “normal”, as used herein, generally refer to a subject not having a disease, or a sample derived therefrom. While health is a dynamic state, the term may refer to the pathological state of a subject that lacks a referenced disease state, for example, cancer. In one example, when referring to a methylation profile that classifies subjects with cancer, the term “healthy” generally refers to an individual lacking cancer, such as CRC. While other diseases or states of health may be present in that subject, the term “healthy” may generally indicate the lack of a stated disease for comparison or classification purposes between subjects having and lacking a disease state, and samples derived therefrom.
- the term “threshold” generally refers to a value that is selected to discriminate, separate, or distinguish between two populations of subjects.
- the threshold discriminates methylation status between a disease (e.g., malignant) state, and a non-disease (e.g., healthy) state.
- the threshold discriminates between stages of disease (e.g., stage 1, stage 2, stage 3, or stage 4). Thresholds may be set according to the disease in question, and may be based on earlier analysis, e.g., of a training set or determined computationally on a set of inputs having a known characteristic (e.g., healthy, disease, or stage of disease).
- Thresholds may also be set for a gene region according to the predictive value of methylation at a particular site. Thresholds may be different for each methylation site, and data from multiple sites may be combined in the end analysis.
- the term “subject” generally refers to an individual, entity or a medium that has or is suspected of having testable or detectable genetic information or material.
- a subject can be a person, individual, or patient.
- the subject can be a vertebrate, such as, for example, a mammal.
- Non-limiting examples of mammals include humans, simians, farm animals, sport animals, rodents, and pets.
- the subject may be displaying a symptom(s) indicative of a health or physiological state or condition of the subject, such as a cancer or a stage of a cancer of the subject.
- a symptom(s) indicative of a health or physiological state or condition of the subject such as a cancer or a stage of a cancer of the subject.
- the subject can be asymptomatic with respect to such health or physiological state or condition.
- sample generally refers to a biological sample obtained from or derived from one or more subjects.
- Biological samples may be cell-free biological samples or substantially cell-free biological samples, or may be processed or fractionated to produce cell- free biological samples.
- cell-free biological samples may include cell-free ribonucleic acid (cfRNA), cell-free deoxyribonucleic acid (cfDNA), cell-free protein and/or cell- free polypeptides.
- a biological sample may be tissue (e.g., tissue obtained by biopsy), blood (e.g., whole blood), plasma, serum, sweat, urine, saliva, or a derivative thereof.
- Cell-free biological samples may be obtained or derived from subjects using an ethylenediaminetetraacetic acid (EDTA) collection tube, a cell-free RNA collection tube (e.g., Streck), or a cell-free DNA collection tube (e.g., Streck).
- EDTA ethylenediaminetetraacetic acid
- Cell-free biological samples may be derived from whole blood samples by fractionation.
- Biological samples or derivatives thereof may comprise cells.
- a biological sample may be a blood sample or a derivative thereof (e.g., blood collected by a collection tube or blood drops), a tumor sample, a tissue sample, a urine sample, or a cell (e.g., tissue) sample.
- nucleic acid library preparation and sequencing of methylated regions for methylation profiling of nucleic acid such as cell-free deoxyribonucleic acid (cfDNA).
- the methods and systems may be directed to double-stranded DNA (dsDNA) library preparation and subsequent methylation sequencing and profiling of nucleic acids from a biological sample which can improve DNA library yield, minimize signal loss, reduce biases, improve coverage, uniformity of coverage, resolution, and accuracy of methylation data to support practical applications such as those described herein.
- dsDNA double-stranded DNA
- the resulting sequencing data obtained from methods provided herein may be useful for applications that utilize methylation profiling data for classifying or stratifying a population of individuals.
- classifying or stratifying of a population of individuals may include identifying and/or detecting individuals as having a disease, staging disease progression (including detection of minimal residual disease (MRD) or determining an individual’s response to a particular treatment for a disease.
- MRD minimal residual disease
- methylation analysis may be coupled with DNA sequencing to determine a likelihood that a sample is normal, tumor-derived, or disease-positive. For example, a relative abundance of methylated or unmethylated DNA fragments that map or align to specific genomic regions may be used to detect or determine disease likelihood.
- Such sequencing methods may include DNA library preparation in which enzymatic processes are used to attach sequencing adapters to double-stranded DNA (dsDNA) fragments.
- the dsDNA library preparation methods comprise attaching conversion-resistant adapters to the dsDNA prior to methyl conversion, amplification, and sequencing.
- conversion-resistant adapters may comprise modified cytosines that are deamination-resistant and resistant to methylation conversion or alteration by a methylation enrichment method.
- the conversion-resistant adapters described herein can comprise one or more conversion-resistant nucleotides (e.g., deamination-resistant modified cytosines) that are not chemically altered by treatment with a methylation conversion agent to change the base pairing specificity of the nucleotide base.
- conversion-resistant adapters can comprise one or more deamination-resistant modified cytosines comprising propynyl-C, pyrrolo- C, or 5-methylcytosine (5mC), or a combination thereof, for use in the library preparation methods disclosed herein in conjunction with sodium bisulfite or enzymatic methylation conversion.
- the conversion-resistant deamination-resistant modified cytosines within the adapters are not deaminated when exposed to sodium bisulfite or a methylation conversion enzyme (e.g., a deamination enzyme).
- the conversion-resistant adapters can comprise one or more propynyl-C residues.
- the conversion-resistant adapters can comprise one or more pyrrolo-C residues.
- the conversion-resistant adapters can comprise one or more 5mC residues.
- the conversion-resistant adapters can comprise a combination of propynyl-C, pyrrolo-C, and 5mC residues.
- the conversion-resistant adapters can comprise a combination of propynyl-C and pyrrolo-C residues. In some embodiments, the conversion-resistant adapters can comprise a combination of propynyl-C and 5mC residues. In some embodiments, the conversion-resistant adapters can comprise a combination of pyrrolo-C and 5mC residues.
- a prepared cell-free nucleic acid library sequence comprising conversion-resistant adapters can further comprise sequence tags, index barcodes, unique molecular identifiers (UMIs), or combinations thereof that are further ligated to cell-free nucleic acid sample molecules for use in library preparation for NGS approaches including fields such as epigenetics.
- the methods and systems set forth herein can further comprise preparing a doublestranded DNA (dsDNA) sequencing library comprising conversion-resistant adapters and subjecting a sample from the dsDNA library to methylation interrogation treatment method such that unmethylated cytosine bases in the dsDNA are converted to uracil bases.
- dsDNA doublestranded DNA
- the adapter-ligated dsDNA upon attaching the conversion-resistant nucleic acid adapters to the dsDNA, can be subjected to enzymatic methyl (EM) conversion or bisulfite treatment methods, and amplified (e.g., via polymerase chain reaction (“PCR”)) in order to provide a base-level resolution of DNA methylation via sequencing.
- EM enzymatic methyl
- PCR polymerase chain reaction
- the adapter ligation is performed before methylation base conversion and amplification.
- methylation interrogating methods are performed prior to adapter ligation.
- performing adapter ligation before methylation base conversion and amplification may be used compared to performing adapter ligation after methylation base conversion.
- enzymatic conversion methods may be used compared to chemical conversion methods (e.g., bisulfite treatment) as chemical treatment may lead to more extensive DNA damage and molecular loss than enzymatic methods.
- the conversion-resistant adapter-ligated dsDNA sequences are enzymatically converted, amplified and sequenced after conversion.
- the conversionresistant adapter sequences can comprise one or more modified cytosine residues such that they are resistant to conversion by either chemical or enzymatic methylation conversion methods.
- These conversion-resistant nucleic acid adapters can comprise deamination-resistant modified cytosines comprising propynyl-C, pyrrolo-C, or 5-methylcytosine (5mC), or a combination thereof.
- the conversion-resistant adapters can comprise one or more propynyl-C residues.
- the conversion-resistant adapters can comprise one or more pyrrolo-C residues. In still another embodiment, the conversion-resistant adapters can comprise one or more 5mC residues. In still another embodiment, the conversion-resistant adapters can comprise a combination of propynyl-C, pyrrolo-C, and 5mC residues. In some embodiments, the conversion-resistant adapters can comprise a combination of propynyl-C and pyrrolo-C residues. In some embodiments, the conversion-resistant adapters can comprise a combination of propynyl-C and 5mC residues. In some embodiments, the conversion-resistant adapters can comprise a combination of pyrrolo-C and 5mC residues.
- DNA methylation analysis may be coupled with sequencing to determine whether a portion of cfDNA is likely to be pre-cancerous or tumor-derived.
- DNA methylation is a covalent modification of DNA and a stable inherited mark that can play an important role in repressing gene expression and regulating chromatin architecture.
- DNA methylation primarily occurs at cytosine residues in CpG dinucleotides. Unlike other dinucleotides, CpGs are not evenly distributed across the genome and can be concentrated in short CpG-rich DNA regions called CpG islands. In general, the majority of the CpG sites in the genome are -70-75% methylated.
- methylation patterns differ from cell type to cell type, reflecting their role in regulating cell typespecific gene expression.
- a cell’s methylome can program the cell’s terminal differentiation state to be, for instance, a neuron, a muscle cell, an immune cell, etc.
- CpG methylation can be deregulated, and aberrations in methylation patterns are some of the earliest events that occur in tumorigenesis. Methylation profiles in a given cancer type most closely resemble that of the tissue of origin of the cancer. Thus, aberrant methylation marks on a cfDNA fragment can be used to differentiate a cancer cell from a normal cell, and determine tissue type origin.
- global CpG methylation levels decrease in cancer cells, but at specific loci, mean methylation levels (or % methylation) can vary at specific CpG sites in cancer cells relative to matched normal cells.
- DMCs differentially methylated CpGs
- DMRs differentially methylated regions
- Bisulfite conversion or bisulfite sequencing may be used for DNA methylation analysis.
- Bisulfite sequencing is a convenient and effective method of mapping DNA methylation to individual bases.
- bisulfite conversion is a harsh and destructive process for cfDNA that leads to degradation of >90% of the sample DNA.
- enzymatic methylation (EM) conversion may be used for DNA methylation analysis and sequencing.
- methylation conversion is mediated by non-destructive enzymatic reactions involving a ten-eleven translocation (TET) enzyme and a cytosine-deaminating enzyme (e.g., APOBEC) to convert unmethylated (but not methylated) cytosines to uracils.
- TET ten-eleven translocation
- APOBEC cytosine-deaminating enzyme
- TAPS Tet-assisted pyridine borane sequencing
- Methods are provided herein for the preparation of a sequencing library for methylation sequencing.
- the methods described herein utilize conversion-resistant adapters to provide a dsDNA library that is acceptable for next generation methylation sequencing applications.
- the resulting raw sequencing data may be used for methylation state analysis, as well as more conventional cfDNA analysis, such as copy number alterations, germline variant detection, somatic variant detection, nucleosome positioning, transcription factor profiling, chromatin immunoprecipitation, and the like.
- one or more of the cytosines present in adapters are replaced with a deaminationresistant modified cytosine selected from propynyl-C, pyrrolo-C, or 5mC to provide conversionresistant adapters used in dsDNA library preparation.
- a deaminationresistant modified cytosine selected from propynyl-C, pyrrolo-C, or 5mC to provide conversionresistant adapters used in dsDNA library preparation.
- the conversionresistant adapters can comprise one or more deamination-resistant modified cytosines.
- the conversion-resistant adapters can comprise one or more propynyl-C residues.
- the conversion-resistant adapters can comprise one or more pyrrolo-C residues.
- the conversion-resistant adapters can comprise one or more 5mC residues.
- the conversion-resistant adapters can comprise a combination of propynyl-C, pyrrolo-C, and 5mC residues.
- the conversion-resistant nucleic acid adapters are ligated to the 5’ and 3’ ends of a population of nucleic acid fragments in a biological sample to produce a sequencing library.
- a collection of conversion-tolerant nucleic acid adapters is ligated to the nucleic acid fragments in a sample.
- unique dual indexes are additional sequences that may be added to the adapters during library preparation.
- the UDI sequences are any length.
- enzymatic methyl conversion can be used in methylation sequencing workflows.
- Examples of enzymatic methyl conversion workflows include enzymatic methyl-seq (EM-seq) and Tet-assisted pyridine borane sequencing (TAPS).
- the methods for dsDNA library preparation for methylation analysis comprise conversion-resistant adapters wherein one or more of the cytosines present in adapters are replaced with a deamination-resistant modified cytosine selected from propynyl-C, pyrrolo- C, or 5mC to provide conversion-resistant adapters used in dsDNA library preparation.
- a deamination-resistant modified cytosine selected from propynyl-C, pyrrolo- C, or 5mC
- downstream library amplification yields can be increased, and more uniform library amplification yields can be achieved as a result of decreased aberrant conversion of methylated cytosines occurring in adapter regions that are then used as template strands for subsequent library amplification.
- EM-seq is a minimally destructive conversion methylation sequencing method for converting cytosines to uracil in nucleic acid. This bi sulfite-free method preserves the length of nucleic acid molecules while achieving conversion rates similar to bisulfite sequencing. Further EM-Seq can result in higher sequencing quality scores for cytosines and guanine base pairs and can provide a more even coverage of various genomic features, such as CpG islands. EM-Seq comprises two sets of enzymatic reactions.
- a ten-eleven translocation (TET) enzyme e.g., TET1, TET2, TET3, Naeglaria TET, and genetically engineered versions and/or variants thereof
- TET ten-eleven translocation
- CE ⁇ -glucosyltransferase e.g., T4-BGT
- a cytosine-deaminating enzyme e.g., APOBEC
- APOBEC cytosine-deaminating enzyme
- Tet-assisted pyridine borane sequencing can be used in enzymatic methylation sequencing workflows.
- TAPS a ten-eleven translocation enzyme (TET1) is used to oxidize both 5mC and 5hmC to 5caC.
- Pyridine borane is used to reduce 5caC to dihydrouracil, a uracil derivative that is then converted to thymine after PCR.
- TAPS can be performed in two other ways: TAPSP and chemical-assisted pyridine borane sequencing (CAPS).
- TAPSP P-glucosyltransferase is used to label 5hmC with glucose to protect 5hmC from the oxidation and reduction reactions and allows for specific detection of 5mC.
- potassium perruthenate acts as the chemical replacement for TET1 and specifically oxidizes 5hmC, thus allowing for direct detection.
- the combination of enzymatic conversion of unmodified C to U, and staggering UMI adapters in line with the library insert can be useful for targeted sequencing of methylation libraries.
- this combination may permit reduced volume inputs of plasma or mass inputs of cfDNA as compared to bisulfite conversion sequencing because sample cfDNA is not degraded to the same extent.
- a method for performing methylation sequencing of a cell-free DNA (cfDNA) sample from a subject comprising: a) extracting a cfDNA sample from the subject, wherein said cfDNA comprises doublestranded nucleic acid (dsDNA) molecules; b) ligating conversion-resistant adapters comprising one or deamination-resistant modified cytosines to said dsDNA molecules, wherein the dsDNA molecules comprise one or more unmethylated cytosine residues; c) enzymatically converting the one or more unmethylated cytosine residues to uracils in the dsDNA molecules ligated to the conversion-resistant adapters; d) amplifying the converted dsDNA molecules comprising the conversion-resistant adapters of c); and e) determining the nucleic acid sequence of the amplified dsDNA molecules comprising the conversion-resistant adapters of c) of d) at a depth of >50x
- the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of between about 50x to about 500x, about 25x to about lOOOx, about 50x to about 500x, about 250x to about 750x, about 500x to about 200x, about 750x to about 1500x, or about lOOx to about 2000x. In some embodiments, a nucleic acid sequence is sequenced at a depth of >100x or >500x.
- the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of about 50x, about 60x, about 70x, about 80x, about 90x, about lOOx, about 11 Ox, about 120x, about 13 Ox, about 140x, about 15 Ox, about 160x, about 170x, about 180x, about 190x, about 200x, about 210x, about 220x, about 230x, about 240x, about 250x, about 275x, about 300x, about 325x, about 350x, about 375x, about 400x, about 425x, about 450x, about 475x, about 500x, about 525x, about 550x, about 575x, about 600x, about 625x, about 650x, about 675x, about 700x, about 725x, about 750x, about 775x, about 800x, about 825x, about 850x, about 875x, about 900x, about 925x, about 950x, about 975
- the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of about 1 lOOx, about 1200x, about 1300x, about 1400x, about 1500x, about 1600x, about 1700x, about 1800x, about 1900x, or about 2000x. [104] In one example, the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of about 500x, about lOOOx, about 2000x, about 3000x, about 4000x, about 5000x, about 6000x, about 7000x, about 8000x, about 9000x, about lOOOOx, or greater than 5000x.
- the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of about 300x unique, about 400x unique, about 500x unique, about 600x unique, about 700x unique, about 800x unique, about 900x unique, or about lOOOx unique, or greater than 500x unique.
- enzymatic methylation sequencing results generates using the dsDNA library preparation methods described herein are used to analyze the methylation state of nucleic acids in a biological sample.
- whole genome enzymatic methyl sequencing (“WG EM-seq”) provides high resolution sequencing by characterizing DNA methylation of nearly every cytidine nucleotide in the genome.
- targeted methods such as targeted enzymatic methyl sequencing (“TEM-seq”), may be useful for methylation analysis.
- assays that have conventionally been used for bisulfite conversion can be employed for minimally-destructive conversion methods, such as enzymatic conversion, TAPS, and CAPS.
- assays used for methylation analysis may be mass spectrometry, methylation-specific PCR (MSP), reduced representation bisulfite sequencing (RRBS), HELP assay, GLAD-PCR assay, ChlP-on-chip assays, restriction landmark genomic scanning, methylated DNA immunoprecipitation (MeDIP), pyrosequencing of bisulfite treated DNA, molecular break light assay, methyl sensitive Southern Blotting, High Resolution Melt Analysis (HRM or HRMA), ancient DNA methylation reconstruction, or Methylation Sensitive Single Nucleotide Primer Extension Assay (msSNuPE).
- MSP methylation-specific PCR
- RRBS reduced representation bisulfite sequencing
- HELP assay HELP assay
- GLAD-PCR assay GLAD-PCR as
- sequence alignment methods include bwa-meth, bismark, Last, GSNAP, BSMAP, NovoAlign, Bison, Metagenomic Phylogenetic Analysis (for example, MetaPhlAn2), BLAT, Burrows-Wheeler Aligner (BWA), Bowtie, Bowtie2, Bfast, BioScope, CLC bio, Cloudburst, ElandZEland2, GenomeMapper, GnuMap, Karma, MAQ, MOM, Mosaik, MrFAST/MrsFAST, PASS, PerM, RazerS, RMAP, SSAHA2, Segemehl, SeqMap, SHRiMP, Slider/Sliderll, Srprism, Stampy, vmatch, ZOOM, and the SOAP/
- feature may generally refer to an individual measurable property or characteristic of a phenomenon being observed.
- Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition.
- the concept of "feature” is related to that of explanatory variable used in statistical techniques such as linear regression.
- the features are inputted into a feature matrix for machine learning analysis.
- the system For a plurality of assays, the system identifies feature sets to input to a machine learning model. The system performs an assay on each molecule class and forms a feature vector from the measured values. The system inputs the feature vector into the machine learning model and obtains an output classification of whether the biological sample has a specified property.
- the machine learning model outputs a classifier that distinguishes between two groups or classes of individuals or features in a population of individuals or features of the population.
- the classifier is a trained machine learning classifier.
- the informative loci or features of biomarkers in a cancer tissue are assayed to form a profile.
- Receiver Operating Characteristic (ROC) curves are useful for plotting the performance of a particular feature (e.g., any of the biomarkers described herein and/or any item of additional biomedical information) in distinguishing between two populations (e.g., individuals responding and not responding to a therapeutic agent).
- the feature data across the entire population e.g., the cases and controls
- the condition is advanced adenoma (AA), colorectal cancer (CRC), colorectal carcinoma, or inflammatory bowel disease.
- input features generally refers to variables that are used by the model to predict an output classification (label) of a sample, e.g., a condition, sequence content (e.g., mutations), suggested data collection operations, or suggested treatments. Values of the variables can be determined for a sample and used to determine a classification.
- Example of input features of genetic data include: aligned variables that relate to alignment of sequence data (e.g., sequence reads) to a genome and non-aligned variables, e.g., that relate to the sequence content of a sequence read, a measurement of protein or autoantibody, or the mean methylation level at a genomic region.
- [116] Values of the variables can be determined for a sample and used to determine a classification.
- Example of input features of genetic data include: aligned variables that relate to alignment of sequence data (e.g., sequence reads) to a genome and non-aligned variables, e.g., that relate to the sequence content of a sequence read, a measurement of protein or autoantibody, or the mean methylation level at a genomic region.
- genetic features such as, V-plot measures, FREE-C, the cfDNA measurement over a transcription start site and DNA methylation levels over cfDNA fragments are used as input features for machine learning methods and models.
- the sequencing information includes information regarding a plurality of genetic features such as, but not limited to, transcription start sites, transcription factor binding sites, chromatin open and closed states, nucleosomal positioning or occupancy, and the like.
- the present disclosure provides a system, method, or kit having data analysis realized in software applications, computing hardware, or both.
- the analysis application or system includes at least a data receiving module, a data pre-processing module, a data analysis module (which can operate on one or more types of genomic data), a data interpretation module, or a data visualization module.
- the data receiving module can comprise computer systems that connect laboratory hardware or instrumentation with computer systems that process laboratory data.
- the data pre-processing module can comprise hardware systems or computer software that performs operations on the data in preparation for analysis. Examples of operations that can be applied to the data in the pre-processing module include affine transformations, denoising operations, data cleaning, reformatting, or subsampling.
- a data analysis module which can be specialized for analyzing genomic data from one or more genomic materials, can, for example, take assembled genomic sequences and perform probabilistic and statistical analysis to identify abnormal patterns related to a disease, pathology, state, risk, condition, or phenotype.
- a data interpretation module can use analysis methods, for example, drawn from statistics, mathematics, or biology, to support understanding of the relation between the identified abnormal patterns and health conditions, functional states, prognoses, or risks.
- a data visualization module can use methods of mathematical modeling, computer graphics, or rendering to create visual representations of data that can facilitate the understanding or interpretation of results. [119]
- machine learning methods are applied to distinguish samples in a population of samples. In one embodiment, machine learning methods are applied to distinguish samples between healthy and advanced adenoma samples.
- the one or more machine learning operations used to train the methylation-based prediction engine include one or more of: a generalized linear model, a generalized additive model, a non-parametric regression operation, a random forest classifier, a spatial regression operation, a Bayesian regression model, a time series analysis, a Bayesian network, a Gaussian network, a decision tree learning operation, an artificial neural network, a recurrent neural network, a reinforcement learning operation, linear/non-linear regression operations, a support vector machine, a clustering operation, and a genetic algorithm operation.
- computer processing methods are selected from logistic regression, multiple linear regression (MLR), dimension reduction, partial least squares (PLS) regression, principal component regression, autoencoders, variational autoencoders, singular value decomposition, Fourier bases, wavelets, discriminant analysis, support vector machine, decision tree, classification and regression trees (CART), tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, multidimensional scaling (MDS), dimensionality reduction methods, t-distributed stochastic neighbor embedding (t-SNE), multilayer perceptron (MLP), network clustering, neuro-fuzzy, and artificial neural networks.
- MLR multiple linear regression
- PLS partial least squares
- principal component regression autoencoders
- variational autoencoders singular value decomposition
- Fourier bases discriminant analysis
- support vector machine decision tree
- classification and regression trees CART
- tree-based methods random forest, gradient boost tree, logistic regression, matrix factorization, multidimensional scaling (MDS), dimensionality reduction methods, t-d
- the methods disclosed herein can include computational analysis on nucleic acid sequencing data of samples from an individual or from a plurality of individuals.
- An analysis can identify a variant inferred from sequence data to identify sequence variants based on probabilistic modeling, statistical modeling, mechanistic modeling, network modeling, or statistical inferences.
- Non-limiting examples of analysis methods include principal component analysis, autoencoders, singular value decomposition, Fourier bases, wavelets, discriminant analysis, regression, support vector machines, tree-based methods, networks, matrix factorization, and clustering.
- Non-limiting examples of variants include a germline variation or a somatic mutation.
- a variant can refer to an already-known variant. The already-known variant can be scientifically confirmed or reported in literature.
- a variant can refer to a putative variant associated with a biological change. A biological change can be known or unknown. In some embodiments, a putative variant can be reported in literature, but not yet biologically confirmed.
- germline variants can refer to nucleic acids that induce natural or normal variations.
- Natural or normal variations can include, for example, skin color, hair color, and normal weight.
- somatic mutations can refer to nucleic acids that induce acquired or abnormal variations.
- Acquired or abnormal variations can include, for example, cancer, obesity, conditions, symptoms, diseases, and disorders.
- the analysis can include distinguishing between germline variants.
- Germline variants can include, for example, private variants and somatic mutations.
- the identified variants can be used by clinicians or other health professionals to improve health care methodologies, accuracy of diagnoses, and cost reduction.
- Samples obtained from subjects other than the patient can also be used. Other samples can also be collected from subjects previously analyzed by a sequencing assay or a targeted sequencing assay (e.g., a targeted resequencing assay).
- Methods, computing systems, or software media disclosed herein can improve identification and accuracy of variations or mutations (e.g., germline or somatic, including copy number variations, single nucleotide variations, indels, a gene fusions), and lower limits of detection by reducing the number of false positive and false negative identifications.
- the present systems and methods provide a classifier generated based on feature information derived from methylation sequence analysis from biological samples of cfDNA prepared by the ssDNA library preparation methods described herein.
- the classifier forms part of a predictive engine for distinguishing groups in a population based on methylation sequence features identified in biological samples such as cfDNA.
- a classifier is created by normalizing the methylation information by formatting similar portions of the methylation information into a unified format and a unified scale; storing the normalized methylation information in a columnar database; training a methylation prediction engine by applying one or more one machine learning operations to the stored normalized methylation information, the methylation prediction engine mapping, for a particular population, a combination of one or more features; applying the methylation prediction engine to the accessed field information to identify a methylation associated with a group; and classifying the individual into a group.
- Specificity may be defined as the probability of a negative test among those who are free from the disease. Specificity is equal to the number of disease-free persons who tested negative divided by the total number of disease-free individuals.
- the model, classifier, or predictive test has a specificity of at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
- Sensitivity may be defined as the probability of a positive test among those who have the disease. Sensitivity is equal to the number of diseased individuals who tested positive divided by the total number of diseased individuals.
- the model, classifier, or predictive test has a sensitivity of at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
- the group is selected from healthy (asymptomatic), cancer, gut- associated diseases, immune-mediated inflammatory diseases, neurological diseases, kidney diseases, prenatal diseases, and metabolic diseases.
- the subject matter described herein can include a digital processing device or use of the same.
- the digital processing device can include one or more hardware central processing units (CPU), graphics processing units (GPU), or tensor processing units (TPU) that carry out the device's functions.
- the digital processing device can include an operating system configured to perform executable instructions.
- the digital processing device can optionally be connected a computer network.
- the digital processing device can be optionally connected to the Internet such that it accesses the World Wide Web.
- the digital processing device can be optionally connected to a cloud computing infrastructure.
- the digital processing device can be optionally connected to an intranet.
- the digital processing device can be optionally connected to a data storage device.
- Non-limiting examples of suitable digital processing devices include server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, handheld computers, Internet appliances, mobile smartphones, and tablet computers.
- Suitable tablet computers can include, for example, those with booklet, slate, and convertible configurations.
- the digital processing device can include an operating system configured to perform executable instructions.
- the operating system can include software, including programs and data, which manages the device's hardware and provides services for execution of applications.
- Non-limiting examples of operating systems include FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®.
- Non-limiting examples of suitable personal computer operating systems include Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®.
- the operating system can be provided by cloud computing, and cloud computing resources can be provided by one or more service providers.
- the device can include a storage and/or memory device.
- the storage and/or memory device can be one or more physical apparatuses used to store data or programs on a temporary or permanent basis.
- the device can be volatile memory and require power to maintain stored information.
- the device can be non-volatile memory and retain stored information when the digital processing device is not powered.
- the non-volatile memory can include flash memory.
- the non-volatile memory can include dynamic random-access memory (DRAM).
- the non-volatile memory can include ferroelectric random access memory (FRAM).
- the non-volatile memory can include phase-change random access memory (PRAM).
- the device can be a storage device including, for example, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage.
- the storage and/or memory device can be a combination of devices such as those disclosed herein.
- the digital processing device can include a display to send visual information to a user.
- the display can be a cathode ray tube (CRT).
- the display can be a liquid crystal display (LCD).
- the display can be a thin film transistor liquid crystal display (TFT-LCD).
- the display can be an organic light emitting diode (OLED) display.
- OLED organic light emitting diode
- on OLED display can be a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display.
- the display can be a plasma display.
- the display can be a video projector.
- the display can be a combination of devices such as those disclosed herein.
- the digital processing device can include an input device to receive information from a user.
- the input device can be a keyboard.
- the input device can be a pointing device including, for example, a mouse, trackball, track padjoystick, game controller, or stylus.
- the input device can be a touch screen or a multi-touch screen.
- the input device can be a microphone to capture voice or other sound input.
- the input device can be a video camera to capture motion or visual input.
- the input device can be a combination of devices such as those disclosed herein.
- Non-transitory computer-readable storage medium
- the subject matter disclosed herein can include one or more non- transitory computer-readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device.
- a computer-readable storage medium can be a tangible component of a digital processing device.
- a computer-readable storage medium can be optionally removable from a digital processing device.
- a computer- readable storage medium can include, for example, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like.
- the program and instructions can be permanently, substantially permanently, semi-permanently, or non- transitorily encoded on the media.
- FIG. 7 shows a computer system 701 that is programmed or otherwise configured to store, process, identify, or interpret patient data, biological data, biological sequences, or reference sequences.
- the computer system 701 can process various aspects of patient data, biological data, biological sequences, or reference sequences of the present disclosure.
- the computer system 701 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
- the electronic device can be a mobile electronic device.
- the computer system 701 includes a central processing unit (CPU, also "processor” and “computer processor” herein) 705, which can be a single core or multi core processor, or a plurality of processors for parallel processing.
- the computer system 701 also includes memory or memory location 710 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 715 (e.g., hard disk), communication interface 720 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 725, such as cache, other memory, data storage and/or electronic display adapters.
- the memory 710, storage unit 715, interface 720, and peripheral devices 725 are in communication with the CPU 705 through a communication bus (solid lines), such as a motherboard.
- the storage unit 715 can be a data storage unit (or data repository) for storing data.
- the computer system 701 can be operatively coupled to a computer network ("network") 730 with the aid of the communication interface 720.
- the network 730 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network 730 in some embodiments is a telecommunication and/or data network.
- the network 730 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- the network 730 in some embodiments with the aid of the computer system 701, can implement a peer-to-peer network, which may enable devices coupled to the computer system 701 to behave as a client or a server.
- the CPU 705 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
- the instructions may be stored in a memory location, such as the memory 710.
- the instructions can be directed to the CPU 705, which can subsequently program or otherwise configure the CPU 705 to implement methods of the present disclosure. Examples of operations performed by the CPU 705 can include fetch, decode, execute, and writeback.
- the CPU 705 can be part of a circuit, such as an integrated circuit.
- a circuit such as an integrated circuit.
- One or more other components of the system 701 can be included in the circuit.
- the circuit is an application specific integrated circuit (ASIC).
- the storage unit 715 can store files, such as drivers, libraries, and saved programs.
- the storage unit 715 can store user data, e.g., user preferences and user programs.
- the computer system 701 in some embodiments can include one or more additional data storage units that are external to the computer system 701, such as located on a remote server that is in communication with the computer system 701 through an intranet or the Internet.
- the computer system 701 can communicate with one or more remote computer systems through the network 730.
- the computer system 701 can communicate with a remote computer system of a user.
- remote computer systems include personal computers (e.g., portable PC), slate or tablet PCs (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
- the user can access the computer system 701 via the network 730.
- Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 701, such as, for example, on the memory 710 or electronic storage unit 715.
- the machine executable or machine-readable code can be provided in the form of software.
- the code can be executed by the processor 705.
- the code can be retrieved from the storage unit 715 and stored on the memory 710 for ready access by the processor 705.
- the electronic storage unit 715 can be precluded, and machine-executable instructions are stored on memory 710.
- the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code or can be interpreted or compiled during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a precompiled, interpreted, or as-compiled fashion.
- aspects of the systems and methods provided herein can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” and may be in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
- Storage type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- a machine readable medium such as computer-executable code
- a tangible storage medium such as computer-executable code
- Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
- Volatile storage media include dynamic memory, such as the main memory of such a computer platform.
- Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
- Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
- RF radio frequency
- IR infrared
- Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
- Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
- the computer system 701 can include or be in communication with an electronic display 735 that comprises a user interface (UI) 740 for providing, for example, a nucleic acid sequence, an enriched nucleic acid sample, an expression profile, and an analysis of an expression profile.
- UI user interface
- Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface.
- Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
- An algorithm can be implemented by way of software upon execution by the central processing unit 705.
- the algorithm can, for example, probe a plurality of regulatory elements, sequence a nucleic acid sample, enrich a nucleic acid sample, determine an expression profile of a nucleic acid sample, analyze an expression profile of a nucleic acid sample, and archive or disseminate results of analysis of an expression profile.
- the subject matter disclosed herein can include at least one computer program or use of the same.
- a computer program can be a sequence of instructions, executable in the digital processing device's CPU, GPU, or TPU, written to perform a specified task.
- Computer-readable instructions can be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types.
- APIs Application Programming Interfaces
- a computer program can be written in various versions of various languages.
- a computer program can include one sequence of instructions. In some embodiments, a computer program can include a plurality of sequences of instructions. In some embodiments, a computer program can be provided from one location. In some embodiments, a computer program can be provided from a plurality of locations. In some embodiments, a computer program can include one or more software modules. In some embodiments, a computer program can include, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins or add-ons, or combinations thereof.
- computer processing can be a method of statistics, mathematics, biology, or any combination thereof.
- the computer processing method includes a dimension reduction method including, for example, logistic regression, dimension reduction, principal component analysis, autoencoders, singular value decomposition, Fourier bases, singular value decomposition, wavelets, discriminant analysis, support vector machine, tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, network clustering, and neural network.
- the computer processing method is a supervised machine learning method including, for example, a regression, support vector machine, tree-based method, and network.
- the computer processing method is an unsupervised machine learning method including, for example, clustering, network, principal component analysis, and matrix factorization.
- the subject matter disclosed herein can include one or more databases, or use of the same to store patient data, biological data, biological sequences, or reference sequences. Reference sequences can be derived from a database.
- suitable databases can include, for example, relational databases, non-relational databases, object-oriented databases, object databases, entity -relationship model databases, associative databases, and XML databases.
- a database can be internetbased.
- a database can be web-based.
- a database can be cloud computing-based.
- a database can be based on one or more local computer storage devices.
- the trained machine learning methods, models, and discriminate classifiers described herein are useful for various medical applications including cancer detection, diagnosis, and treatment responsiveness.
- models are trained with individual metadata and analyte-derived features, the applications may be tailored to stratify individuals in a population and guide treatment decisions accordingly.
- Methods and systems provided herein may perform predictive analytics using artificial intelligence-based approaches to analyze acquired data from a subject (patient) to generate an output of the detection and/or diagnosis of a subject having a cancer (e.g., CRC).
- the application may apply a prediction algorithm to the acquired data to generate the detection of cancer thereby providing a diagnosis that the subject having the cancer.
- the prediction algorithm may comprise an artificial intelligence-based predictor, such as a machine learning-based predictor, configured to process the acquired data to generate the diagnosis of the subject having the cancer.
- the machine learning predictor may be trained using datasets, e.g., datasets generated by performing multi-analyte assays of biological samples of individuals, from one or more sets of cohorts of patients having cancer as inputs and known diagnosis (e.g., staging and/or tumor fraction) outcomes of the subjects as outputs to the machine learning predictor.
- datasets e.g., datasets generated by performing multi-analyte assays of biological samples of individuals, from one or more sets of cohorts of patients having cancer as inputs and known diagnosis (e.g., staging and/or tumor fraction) outcomes of the subjects as outputs to the machine learning predictor.
- Training datasets may be generated from, for example, one or more sets of subjects having common characteristics (features) and outcomes (labels). Training datasets may comprise a set of features and labels corresponding to the features relating to diagnosis. Features may comprise characteristics such as, for example, certain ranges or categories of cfDNA assay measurements, such as counts of cfDNA fragments in a biological sample obtained from a healthy and disease samples that overlap or fall within each of a set of bins (genomic windows) of a reference genome.
- a set of features collected from a given subject at a given time point may collectively serve as a diagnostic signature, which may be indicative of an identified cancer of the subject at the given time point.
- Characteristics may also include labels indicating the subject's diagnostic outcome, such as for one or more cancers.
- Labels may comprise outcomes such as, for example, a known diagnosis (e.g., staging and/or tumor fraction) outcomes of the subject.
- Outcomes may include a characteristic associated with the cancers in the subject. For example, characteristics may be indicative of the subject having one or more cancers.
- Training sets may be selected by random sampling of a set of data corresponding to one or more sets of subjects (e.g., retrospective and/or prospective cohorts of patients having or not having one or more cancers).
- training sets e.g., training datasets
- training sets may be selected by proportionate sampling of a set of data corresponding to one or more sets of subjects (e.g., retrospective and/or prospective cohorts of patients having or not having one or more cancers).
- Training sets may be balanced across sets of data corresponding to one or more sets of subjects (e.g., patients from different clinical sites or trials).
- the machine learning predictor may be trained until certain pre-determined conditions for accuracy or performance are satisfied, such as having minimum desired values corresponding to diagnostic accuracy measures.
- the diagnostic accuracy measure may correspond to prediction of a diagnosis, staging, or tumor fraction of one or more cancers in the subject.
- Examples of detection and diagnostic accuracy measures may include sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and area under the curve (AUC) of a ROC curve corresponding to the diagnostic accuracy of detecting or predicting the cancer (e.g., colorectal cancer).
- the present disclosure provides a method for detecting or identifying a cancer in a subject, comprising: (a) providing a biological sample comprising ssDNA molecules of a cell-free DNA sample derived from said subject; (b) methylation sequencing said ssDNA molecules from said subject to generate a plurality of sequencing reads; (c) aligning said sequencing reads to a reference genome; (d) generating a quantitative measure of said sequencing reads at each of a first plurality of genomic regions of said reference genome to generate a first feature set, wherein said first plurality of genomic regions of said reference genome comprises at least about 10 distinct regions, each of said at least about 10 distinct regions; and (e) applying a trained algorithm to said first feature set to generate a likelihood of said subject having said cancer.
- such a pre-determined condition may be that the sensitivity of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- the cancer e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer
- the sensitivity of predicting the cancer comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at
- such a pre-determined condition may be that the specificity of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- the specificity of predicting the cancer comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- such a pre-determined condition may be that the positive predictive value (PPV) of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- PSV positive predictive value
- such a pre-determined condition may be that the negative predictive value (NPV) of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
- NSV negative predictive value
- such a pre-determined condition may be that the AUC of a ROC curve of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99.
- the cancer e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer
- a method further comprises monitoring a progression of a disease in the subject, wherein the monitoring is based at least in part on the genetic sequence feature.
- the disease is a cancer.
- a method further comprises determining the tissue-of-origin of a cancer in the subject, wherein the determining is based at least in part on the genetic sequence feature.
- a method further comprises estimating a tumor burden in the subject, wherein the estimating is based at least in part on the genetic sequence feature.
- the predictive classifiers, systems and methods described herein are useful for classifying populations of individuals for a number of clinical applications (e.g., based on performing multi-analyte assays of biological samples of individuals). Examples of such clinical applications include, detecting early-stage cancer, diagnosing cancer, classifying cancer to a particular stage of disease, or determining responsiveness or resistance to a therapeutic agent for treating cancer.
- the methods and systems described herein are applicable to various cancer types, similar to grade and stage, and as such, is not limited to a single cancer disease type. Therefore, combinations of analytes and assays may be used in the present systems and methods to predict responsiveness of cancer therapeutics across different cancer types in different tissues and classifying individuals based on treatment responsiveness. In one example, the classifiers described herein stratify a group of individuals into treatment responders and non-responders.
- the present disclosure also provides a method for determining a drug target of a condition or disease of interest (e.g., genes that are relevant/important for a particular class), comprising assessing a sample obtained from an individual for the level of gene expression for at least one gene; using a neighborhood analysis routine to determine genes that are relevant for classification of the sample, thereby ascertaining one or more drug targets relevant to the classification.
- a drug target of a condition or disease of interest e.g., genes that are relevant/important for a particular class
- the present disclosure also provides a method for determining the efficacy of a drug designed to treat a disease class, comprising obtaining a sample from an individual having the disease class; subjecting the sample to the drug; assessing the drug-exposed sample for the level of gene expression for at least one gene; and using a computer model built with a weighted voting scheme to classify the drug-exposed sample into a class of the disease as a function of relative gene expression level of the sample with respect to that of the model.
- the present disclosure also provides a method for determining the efficacy of a drug designed to treat a disease class, wherein an individual has been subjected to the drug, comprising obtaining a sample from the individual subjected to the drug; assessing the sample for the level of gene expression for at least one gene; and using a model built with a weighted voting scheme to classify the sample into a class of the disease including evaluating the gene expression level of the sample as compared to gene expression level of the model.
- Yet another application is a method of determining whether an individual belongs to a phenotypic class (e.g., intelligence, response to a treatment, length of life, likelihood of viral infection or obesity) that comprises obtaining a sample from the individual; assessing the sample for the level of gene expression for at least one gene; and using a model built with a weighted voting scheme, classifying the sample into a class of the disease including evaluating the gene expression level of the sample as compared to gene expression level of the model.
- a phenotypic class e.g., intelligence, response to a treatment, length of life, likelihood of viral infection or obesity
- Biomarkers may be useful for predicting prognosis of patients with colon cancer.
- the ability to classify patients as high-risk (poor prognosis) or low-risk (favorable prognosis) may enable selection of appropriate therapies for these patients. For example, high-risk patients are likely to benefit from aggressive therapy, whereas therapy may have no significant advantage for low-risk patients.
- the systems and methods described herein that relate to classifying a population based on treatment responsiveness refer to cancers that are treated with chemotherapeutic agents of the classes DNA damaging agents, DNA repair target therapies, inhibitors of DNA damage signaling, inhibitors of DNA damage induced cell cycle arrest, and inhibition of processes indirectly leading to DNA damage, but not limited to these classes.
- chemotherapeutic agents may be considered a "DNA-damage therapeutic agent”.
- the patient's analyte data are classified in high-risk and low-risk patient groups, such as patient with a high-risk or low-risk of clinical relapse, and the results may be used to determine a course of treatment.
- a patient determined to be a high-risk patient may be treated with adjuvant chemotherapy after surgery.
- adjuvant chemotherapy may be withheld after surgery.
- the present disclosure provides, in certain aspects, a method for preparing a gene expression profile of a colon cancer tumor that is indicative of risk of recurrence.
- the classifiers described herein stratify a population of individuals between responders and non-responders to treatment.
- the treatment is selected from alkylating agents, plant alkaloids, antitumor antibiotics, antimetabolites, topoisomerase inhibitors, retinoids, checkpoint inhibitor therapy, and VEGF inhibitors.
- Examples of treatments for which a population may be stratified into responders and non- responders include but are not limited to: chemotherapeutic agents including sorafenib, regorafenib, imatinib, eribulin, gemcitabine, capecitabine, pazopanib, lapatinib, dabrafenib, sunitinib, crizotinib, everolimus, torisirolimus, sirolimus, axitinib, gefitinib, anastrozole, bicalutamide, fulvestrant, raltitrexed, pemetrexed, goserelin acetate, erlotinib, vemurafenib, vismodegib, tamoxifen citrate, paclitaxel, docetaxel, cabazitaxel, oxaliplatin, ziv-aflibercept, bevacizumab, trastuzumab, trast
- a population may be stratified into responders and non-responders for checkpoint inhibitor therapies such as compounds that bind to PD-1 or CTLA4.
- a population may be stratified into responders and non-responders for anti-VEGF therapies that bind to VEGF pathway targets.
- a biological condition can include a disease.
- a biological condition can be a stage of a disease.
- a biological condition can be a gradual change of a biological state.
- a biological condition can be a treatment effect.
- a biological condition can be a drug effect.
- a biological condition can be a surgical effect.
- a biological condition can be a biological state after a lifestyle modification. Non-limiting examples of lifestyle modifications include a diet change, a smoking change, and a sleeping pattern change.
- a biological condition is unknown. The analysis described herein can include machine learning to infer an unknown biological condition or to interpret the unknown biological condition.
- the present systems and methods are particularly useful for applications related to colon cancer: Cancer that forms in the tissues of the colon (the longest part of the large intestine). Most colon cancers are adenocarcinomas (cancers that begin in cells that make line internal organs and have gland-like properties). Cancer progression is characterized by stages, or the extent of cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes comprise cancer, and whether the cancer has spread from the original site to other parts of the body. Stages of colon cancer include stage I, stage II, stage III, and stage IV.
- colon cancer refers to colon cancer at Stage 0, Stage I, Stage II (including Stage IIA or IIB), Stage III (including Stage IIIA, IIIB, or IIIC), or Stage IV.
- the colon cancer is from any stage.
- the colon cancer is a stage I colorectal cancer.
- the colon cancer is a stage II colorectal cancer.
- the colon cancer is a stage III colorectal cancer.
- the colon cancer is a stage IV colorectal cancer.
- Conditions that can be inferred by the disclosed methods include, for example, cancer, gut-associated diseases, immune-mediated inflammatory diseases, neurological diseases, kidney diseases, prenatal diseases, and metabolic diseases.
- a method of the present disclosure can be used to diagnose a cancer.
- cancers include adenoma (adenomatous polyps), sessile serrated adenoma (SSA), advanced adenoma, colorectal dysplasia, colorectal adenoma, colorectal cancer, colon cancer, rectal cancer, colorectal carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal carcinoid tumors, gastrointestinal stromal tumors (GISTs), lymphomas, and sarcomas.
- adenoma adenomatous polyps
- SSA sessile serrated adenoma
- SSA sessile serrated adenoma
- advanced adenoma colorectal dysplasia
- colorectal adenoma colorectal cancer
- colon cancer rectal cancer, colorectal carcinoma, colorectal adenocarcinoma, carcinoid tumors
- Non-limiting examples of cancers that can be inferred by the disclosed methods and systems include acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), adrenocortical carcinoma, Kaposi Sarcoma, anal cancer, basal cell carcinoma, bile duct cancer, bladder cancer, bone cancer, osteosarcoma, malignant fibrous histiocytoma, brain stem glioma, brain cancer, craniopharyngioma, ependymoblastoma, ependymoma, medulloblastoma, medulloepithelioma, pineal parenchymal tumor, breast cancer, bronchial tumor, Burkitt lymphoma, Non-Hodgkin lymphoma, carcinoid tumor, cervical cancer, chordoma, chronic lymphocytic leukemia (CLL), chronic myelogenous leukemia (CML), colon cancer, colorectal cancer, cutaneous T-cell lymph
- Non-limiting examples of gut-associated diseases that can be inferred by the disclosed methods and systems include Crohn's disease, colitis, ulcerative colitis (UC), inflammatory bowel disease (IBD), irritable bowel syndrome (IBS), and celiac disease.
- the disease is inflammatory bowel disease, colitis, ulcerative colitis, Crohn's disease, microscopic colitis, collagenous colitis, lymphocytic colitis, diversion colitis, Behcet's disease, and indeterminate colitis.
- Non-limiting examples of immune-mediated inflammatory diseases that can be inferred by the disclosed methods and systems include psoriasis, sarcoidosis, rheumatoid arthritis, asthma, rhinitis (hay fever), food allergy, eczema, lupus, multiple sclerosis, fibromyalgia, type 1 diabetes, and Lyme disease.
- Non-limiting examples of neurological diseases that can be inferred by the disclosed methods and systems include Parkinson's disease, Huntington's disease, multiple sclerosis, Alzheimer's disease, stroke, epilepsy, neurodegeneration, and neuropathy.
- Non-limiting examples of kidney diseases that can be inferred by the disclosed methods and systems include interstitial nephritis, acute kidney failure, and nephropathy.
- Non-limiting examples of prenatal diseases that can be inferred by the disclosed methods and systems include Down syndrome, aneuploidy, spina bifida, trisomy, Edwards syndrome, teratomas, sacrococcygeal teratoma (SCT), ventriculomegaly, renal agenesis, cystic fibrosis, and hydrops fetalis.
- Non-limiting examples of metabolic diseases that can be inferred by the disclosed methods and systems include cystinosis, Fabry disease, Gaucher disease, Lesch-Nyhan syndrome, Niemann-Pick disease, phenylketonuria, Pompe disease, Tay-Sachs disease.
- EXAMPLE 1 DNA Extraction and Library Preparation for Methylation Conversion with Conversion-Resistant Sequencing Adapters.
- EM-seq workflows may include one or more methylated cytosine nucleotides in adapter molecules to facilitate downstream library amplification and sequencing after deamination.
- these adapter regions which act as template strands for library amplification, can comprise mismatches in the primer binding sites thereby leading to low library yields and undesirable PCR products (Figure 1).
- modified cytosines include propynyl-C, pyrrolo-C, and 5mC.
- the conversion-resistant adapters described herein may be used during library preparation which is a process that takes a DNA molecule and adds the necessary components needed for sequencing. More specifically, contrived cfDNA comprising -35% methylated DNA was used as target insert molecules for conversion-resistant adapter ligation. cfDNA was prepared for ligation by first performing end repair and A-tailing (ERAT) which digests 3’ overhangs, fills in 5’ overhangs, and adds a single adenosine (A) overhang to the 3’ end. The resulting product was then ligated to Y-adapters (conversion-resistant) appropriate for the indicated method in order to facilitate library PCR and sequencing.
- EERT end repair and A-tailing
- Y-adapters also known as stubby adapters, can comprise a partial dsDNA duplex comprising a T overhang at one end of the molecule and ssDNA overhangs on the other end of the molecule, resulting in a Y shape.
- Some of the adapter sequence serves as a template for downstream PCR amplification.
- cytosines were either left unmodified, or modified by the addition of deamination resistant cytosines (e.g., methyl, pyrrolo, or propynyl group).
- the molecules underwent EM-conversion which comprises two operations: oxidation and deamination.
- EM-conversion methylated cytosines (5mC) may be oxidized by the TET2 (Ten-eleven translocation- 2) enzyme.
- cytosines This converts methylated cytosines to further states of oxidation (to hydroxymethylated [5hmC], then further to formylcytosine [5fC], and carboxylcytosine [5caC].)
- deamination may then be carried out using the enzyme APOB EC which can convert cytosines, as well as 5mC and 5hmC, to uracil and thymine and 5-hydrodroxyl uracil (uracil species) respectively.
- the cytosines that were fully oxidized for example to 5fC and 5caC
- the adapters comprise cytosines with modifications such as pyrrolo-C or propynyl-C, those modifications are resistant to deamination and will also not be converted.
- modifications do not require protection during oxidation.
- the adapter ligated cfDNA molecules were amplified by PCR using primers matching the expected final sequence of adapters after EM-conversion.
- Some adapter sequences required for PCR amplification rely on cognate cytosines to be resistant to enzymatic methyl conversion (EM-conversion), while in other modalities primers rely on matching the fully converted sequence.
- Amplification primers also harbor sequences to distinguish between different library prep reactions (index sequences) as well as those required for NGS sequencing.
- Figure 4 provides a graph illustrating a comparison of dsDNA library concentrations (library yields) between differing oxidation reactions (TET -2 -mediated reactions) comprising: (i) IX (final reaction concentration) pre-plated thawed Fe2+ (SOP) or (ii) 6X (final reaction concentration) freshly-diluted Fe2+ with 10 ng of input DNA.
- Figure 5 provides a graphs illustrating a comparison of CpH protection from EM-seq reactions under differing oxidation conditions (e.g., final reaction Fe2+ concentrations): (i) IX concentrated pre-plated thawed Fe2+ (SOP); (ii) IX concentrated freshly-diluted Fe2+; (iii) 3X concentrated freshly-diluted Fe2+; (iv) 6X concentrated freshly-diluted Fe2+; (v) 9X concentrated freshly-diluted Fe2+; (vi) 12X concentrated freshly-diluted Fe2+; (vii) 18X concentrated freshly-diluted Fe2+; (viii) 36X concentrated freshly-diluted Fe2+; and (ix) 216X concentrated freshly-diluted Fe2+.
- SOP IX concentrated pre-plated thawed Fe2+
- Figure 6 provides graphs illustrating a comparison of CpH protection rates from EM-seq reactions under differing reaction conditions (e.g., final Fe2+ concentrations and master mix pH levels) with different reagent lots: (i) IX concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Left graph); (ii) 3X concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Middle graph); (iii) 6X concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Right graph).
- EXAMPLE 2 DNA Extraction and Library Preparation for Methylation Conversion with Conversion-Resistant Sequencing Adapters.
- cfDNA extraction is performed by incubating in the presence of carboxyl coated magnetic beads. Following the incubation, the beads are washed and cfDNA is eluted. The elute cfDNA is transferred to a plate and frozen at -20°C.
- the cfDNA is thawed and library preparation is performed using the cfDNA from operation 1 (cfDNA extraction). End preparation buffer and end preparation enzyme mix are added to each sample, and the samples are incubated at 20°C for 30 minutes and 65°C for 30 minutes. Following this, the end-prepped cfDNA is incubated with ligation buffer, ligase, and adapters for 15 min at 20°C.
- the adapters are Y-shaped adapters synthesized with 5- methylcytosines, pyrrolo-C and/or propynyl-C in place of cytosines to enable protection of the adapter sequences during conversion operations. Following end repair, A-tailing, and ligation, a SPRI bead cleanup is performed.
- Enzymatic conversion is performed on the eluted DNA by treating the eluted DNA for 1 hour at between 37°C - 43 °C with TET2 enzyme and P- glucosyltransferase. Additionally, ferrous iron is added to the reaction at twice the recommended concentration to stabilize TET2 activity. The TET2 reaction is stopped via addition of an oxidation stop buffer and a bead cleanup is performed. At this point, eluted DNA is frozen at - 20°C.
- the oxidation-treated cfDNA library is thawed and the samples are heat denatured and treated for 3 hours at 37°C with APOB EC to deaminate unprotected cytosines, thereby converting them to uracils. This is followed with a proteinase K treatment to denature the APOBEC and a 10-minute treatment at 65°C to inactivate the proteinase.
- the samples are amplified via a universal PCR reaction with indexing primers. After a second proteinase K treatment, a bead purification cleanup is performed.
- the library DNA is eluted into elution buffer and is stored at -20°C.
- the cfDNA libraries are used as input for target capture.
- the samples are pooled and concentrated. These library pools are subjected to a bead cleanup, then are eluted in an elution buffer comprising blocker solution and blockers.
- Target enrichment is then performed with biotinylated probes designed to target both fully converted and fully unconverted methyl-cytosines (Twist Fast Hybridization Target Capture kit) following the manufacturer’s instructions wherein the biotinylated panel and hybridization buffer are added to the library pools, which are then incubated at 60°C for 4 hours.
- the probe-library pool mixtures are bound to streptavidin beads, then washed. After washing, the captured library pools and streptavidin beads are resuspended in elution buffer. Following hybridization, the capture DNA fragments are amplified by universal PCR.
- Raw data files are used for alignment and methylation calling to permit targeted methylation analysis for pre-identified regions of the genome.
- FASTQ files are mapped to a reference genome, and methylation scores are calculated for disease classification.
- Featurized data comprising a set of CpG sites associated with healthy, disease, disease state, and treatment responsiveness is processed using machine learning models to identify classifiers that stratify individuals in a population based upon hypermethylation model scores.
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Methods and systems provided herein address limitations in standard nucleic acid library preparation used in conjunction with methylation sequencing which can increase library yield, achieve more uniform library amplification yields, and reduce undesired PCR products thereby minimizing signal loss and reducing biases that can be introduced in standard library preparation methods as well as improve the quality and accuracy of nucleic acid methylation sequencing and uses thereof, for example, in detection of disease. More accurate and complete information regarding methylation state permits higher quality feature generation for use in machine learning models and classifier generation, and the uses thereof for detection of disease by improving the quality, sensitivity, and accuracy.
Description
METHODS AND SYSTEMS FOR IMPROVED METHYLATION SEQUENCING
CROSS-REFERENCE
[1] This application claims the benefit of U.S. Provisional Application No. 63/644,266, filed May 8, 2024, U.S. Provisional Application No. 63/697,746, filed September 23, 2024, and U.S. Provisional Application No. 63/703,374, filed October 4, 2024, each of which is incorporated by reference herein in its entirety.
BACKGROUND
[2] Nucleic acid methylation can represent tumor characteristics and phenotypic states and has potential for use in early disease detection and/or diagnosis as well as personalized medicine. For example, deoxyribonucleic acid (DNA) methylation abnormalities may be associated with various stages of cancer, from tumor initiation to cancer progression and metastasis. Aberrant DNA methylation patterns occur early in the pathogenesis of cancer and can provide a mechanism for early cancer detection. These properties enable the use of DNA methylation patterns for cancer diagnosis.
SUMMARY
[3] The present disclosure provides methods and systems for nucleic acid library preparation for methylation sequencing which increases library yield and reduces undesired PCR products thereby minimizing signal loss and reducing biases that can be introduced in standard library preparation methods. Further, the presently disclosed library preparation methods and systems may improve the quality and accuracy of nucleic acid methylation sequencing and uses thereof, for example, in detection of disease. More accurate and complete information regarding methylation state permits higher quality feature generation for use in machine learning models and classifier generation.
[4] In an aspect, the present disclosure provides methods of preparing a sequencing library for methylation sequencing of one or more nucleic acid molecules of a biological sample or a derivative thereof, comprising: (a) obtaining a nucleic acid composition, wherein the nucleic acid composition comprises a plurality of double-stranded nucleic acid molecules obtained or derived from the biological sample; (b) ligating a conversion-resistant nucleic acid adapter to a doublestranded nucleic acid molecule of the plurality of double-stranded nucleic acid molecules to generate a conversion-resistant adapter-ligated nucleic acid molecule, wherein the nucleic acid adapter comprises modified cytosines that are resistant to base conversion by a methylation conversion method; and (c) subjecting the conversion-resistant adapter-ligated nucleic acid
molecule to conditions sufficient to convert unmethylated cytosines to uracils using the methylation conversion method, thereby generating a converted conversion-resistant adapter- ligated nucleic acid molecule.
[5] In some embodiments, the ligating in (b) further comprises treating with a deoxyribonucleic acid (DNA) ligase.
[6] In some embodiments, the conversion-resistant nucleic acid adapter comprises one or more modified cytosines that are deamination-resistant.
[7] In some embodiments, the conversion-resistant nucleic acid adapter comprises one or more modified cytosine bases selected from a group consisting of propynyl-C, pyrrolo-C, and 5- methylcytosine (5mC).
[8] In some embodiments, the one or more modified cytosines that are deamination-resistant are propynyl-C residues.
[9] In some embodiments, the one or more modified cytosines that are deamination-resistant are pyrrolo-C residues.
[10] In some embodiments, the conversion-resistant nucleic acid adapter comprises one or more 5mC residues.
[11] In some embodiments, the conversion-resistant nucleic acid adapter does not comprise methylated cytosine bases or unmethylated cytosine bases.
[12] In some embodiments, the method further comprises amplifying the converted conversion-resistant adapter-ligated nucleic acid molecule.
[13] In some embodiments, the amplifying comprises polymerase chain reaction (PCR).
[14] In some embodiments, the method further comprises determining a nucleic acid sequence of the amplified adapter-ligated nucleic acid molecule or derivative thereof.
[15] In some embodiments, the method further comprises sequencing the amplified adapter- ligated nucleic acid molecule or derivative thereof to generate sequencing data.
[16] In some embodiments, the method further comprises analyzing the sequencing data to generate a methylation profile of the nucleic acid molecule of the biological sample or derivative thereof.
[17] In some embodiments, the analyzing further comprises comparing the sequencing data to a reference sequence.
[18] In some embodiments, the nucleic acid molecule comprises deoxyribonucleic acid (DNA).
[19] In some embodiments, the DNA comprises cell-free DNA.
[20] In some embodiments, the biological sample is a cell-free biological sample.
[21] In some embodiments, the cell-free biological sample is a plasma sample.
[22] In some embodiments, the methylation conversion method comprises a minimally- destructive conversion method treatment with one or more enzymes.
[23] In some embodiments, the minimally-destructive conversion method comprises treatment with a ten eleven translocation (TET) enzyme.
[24] In some embodiments, the treatment with TET enzyme comprises providing reaction conditions comprising 1X-18X freshly-diluted Fe2+.
[25] In some embodiments, the minimally-destructive conversion method does not comprise treatment with bisulfite.
[26] In an aspect, the present disclosure provides methods for performing methylation sequencing of a cell -free deoxyribonucleic acid (cfDNA) sample from a subject, comprising: (a) extracting the cfDNA sample from the subject, wherein said cfDNA sample comprises doublestranded nucleic acid (dsDNA) molecules; (b) ligating conversion-resistant adapters comprising one or more deamination-resistant modified cytosines to the dsDNA molecules, wherein the dsDNA molecules comprise one or more unmethylated cytosine residues; (c) enzymatically converting the one or more unmethylated cytosine residues to uracils in the dsDNA molecules ligated to the conversion-resistant adapters; (d) amplifying the converted dsDNA molecules comprising the conversion-resistant adapters of (c) to produce amplified dsDNA molecules; and (e) determining a nucleic acid sequence of the amplified dsDNA molecules comprising the conversion-resistant adapters of (d).
[27] In some embodiments, the method further comprises: (f) determining a methylation profile of the cfDNA sample from the subject based at least in part on (e); (g) classifying by a trained machine learning algorithm the methylation profile of the cfDNA sample as indicative of a presence of a cancer in the subject; and (h) outputting a report that identifies the cfDNA sample as negative for the cancer if the trained machine learning algorithm classifies the cfDNA sample as negative for the cancer at a specified confidence level.
[28] In some embodiments, the determining the methylation profile comprises performing a hypermethylation analysis.
[29] In some embodiments, the cancer comprises two or more of colorectal cancer, breast cancer, pancreatic cancer, liver cancer, or lung cancer.
[30] In some embodiments, the cancer comprises colorectal cancer, breast cancer, pancreatic cancer, liver cancer, or lung cancer.
[31] In some embodiments, the cancer comprises the colorectal cancer.
[32] In some embodiments, the cancer comprises the lung cancer.
[33] In some embodiments, the cancer comprises the liver cancer.
[34] In some embodiments, the cancer comprises the pancreatic cancer.
[35] In some embodiments, the method further comprises: (f) determining a baseline methylation profile of the cfDNA sample of the subject at a baseline methylation state based at least in part on (e); (g) determining a test methylation profile of a biological sample of the subject at one or more time points following the baseline methylation state of (f); and (h) determining a change in the test methylation profile as compared to the baseline methylation profile, wherein the change is indicative of a change in a minimal residual disease status of the subject.
[36] In some embodiments, the minimal residual disease status is selected from the group consisting of response to a treatment, tumor load, residual tumor post-surgery, relapse, secondary screen, primary screen, and cancer progression.
[37] In some embodiments, determining the baseline methylation profile comprises a hypermethylation analysis.
[38] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative examples of the present disclosure are shown and disclosed. As will be realized, the present disclosure is capable of other and different examples, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure.
Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
INCORPORATION BY REFERENCE
[39] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.
BRIEF DESCRIPTION OF THE DRAWINGS
[40] The features of the inventive concepts are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present methods and systems will be obtained by reference to the following detailed description that sets forth
illustrative examples, in which the principles of the methods and systems are utilized, and the accompanying drawings (also “figure” and “FIG.” herein), of which:
[41] FIG. 1 shows a hypothetical single-stranded deoxyribonucleic acid (ssDNA) template region (primer binding site) of a Y-adapter comprising methylated cytosines ligated to a cfDNA insert. Top: If oxidation of methylated cytosines is complete, cytosine will not be converted after deamination resulting in a perfect match between template and primer. Bottom: If oxidation of methylated cytosine is incomplete, cytosine is converted to thymine or uracil species resulting in template/primer mismatches.
[42] FIGs. 2A-2E provide conversion-resistant modified cytosines (FIGs. 2A-2C) which can be used in the conversion-resistant adapter design strategies tested disclosed herein. Specific deamination-resistant modified cytosines can include propynl-C, pyrrolo-C, and 5mC. In FIG.
D, ssDNA adapter overhangs are annealed with short complementary oligos comprising a 3' blocker (star) to prevent extension by a polymerase. FIG. 2E shows an adapter region comprising a minimal number of unmethylated cytosines being carried through oxidation and deamination, followed by PCR using a primer that anneals to the fully converted sequence.
[43] FIG. 3 provides a comparison of dsDNA library concentration (library yields) based upon differing types of conversion-resistant adapters and alternative adapter design types used during library preparation. 10 nanograms (ng) (Top graphs) or 3.5 ng (Bottom graphs) of contrived cfDNA was ligated the appropriate adapters. Material was then subjected to TET2- mediated oxidation with DTT (SOP; Left graphs) or without DTT (Right graphs) included in the reaction buffer prior to deamination by APOB EC.
[44] FIG. 4 provides a comparison of dsDNA library concentrations (library yields) between differing oxidation reactions (TET -2 -mediated reactions) comprising: (i) IX (final reaction concentration) pre-plated thawed Fe2+ (SOP) or (ii) 6X (final reaction concentration) freshly- diluted Fe2+ with 10 ng of input DNA.
[45] FIG. 5 provides a comparison of CpH protection from EM-seq reactions under differing oxidation conditions (e.g., final reaction Fe2+ concentrations): (i) IX concentrated pre-plated thawed Fe2+ (SOP); (ii) IX concentrated freshly-diluted Fe2+; (iii) 3X concentrated freshly- diluted Fe2+; (iv) 6X concentrated freshly-diluted Fe2+; (v) 9X concentrated freshly-diluted Fe2+; (vi) 12X concentrated freshly-diluted Fe2+; (vii) 18X concentrated freshly-diluted Fe2+; (viii) 36X concentrated freshly-diluted Fe2+; and (ix) 216X concentrated freshly-diluted Fe2+.
[46] FIG. 6 provides a comparison of CpH protection rates from EM-seq reactions under differing reaction conditions (e.g., final Fe2+ concentrations and master mix pH levels) with different reagent lots: (i) IX concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Left
graph); (ii) 3X concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Middle graph); (iii) 6X concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Right graph).
[47] FIG. 7 shows a computer system that is programmed or otherwise configured to implement methods provided herein.
Terms
[48] As used herein, singular terms, e.g., “a”, “an”, and “the” include both singular and plural referents unless the context clearly dictates otherwise.
[49] The terms “plasma cell-free DNA”, “circulating free DNA”, “cell-free DNA”, or “cfDNA”, as used herein, may generally refer to DNA molecules that circulate in the acellular portion of blood. Circulating nucleic acids in blood may arise from necrotic or apoptotic cells indicative of disease, such as cancer. In cancer, circulating DNA bears hallmark signs of the disease, including mutations in oncogenes and microsatellite alterations. These circulating DNA may be referred to as circulating tumor DNA (ctDNA). Viral genomic sequences, DNA, or RNA in plasma is a potential biomarker for disease.
[50] In some embodiments, the cell-free fraction of blood may be blood serum or blood plasma. The term “cell-free fraction” of a biological sample as used herein generally refers to a fraction of the biological sample that is substantially free of cells. As used herein, the term “substantially free of cells” may generally refer to a preparation from the biological sample comprising fewer than about 20,000 cells per ml, fewer than about 2,000 cells per ml, fewer than about 200 cells per ml, or fewer than about 20 cells per ml. Genomic DNA (gDNA) refers to non-fragmented DNA that is released from white blood cells contaminating the blood cell-free fraction. To mitigate gDNA from contaminating samples, a highly controlled sample processing workflow may be implemented, and specimens may be screened against the presence of gDNA.
[51] As used herein, the term “nucleic acid” generally refers to a polynucleotide comprising two or more nucleotides. It may be DNA or RNA. Nucleic acid may be genomic or derived from the genome of a eukaryotic or prokaryotic cell, or synthetic, cloned, amplified, or reverse transcribed. In certain embodiments of the methods and compositions, nucleic acid may refer to genomic DNA as the context requires. The nucleic acid may be a polymeric form of nucleotides of any length, either deoxyribonucleotides (dNTPs) or ribonucleotides (rNTPs), or analogs thereof. Nucleic acids may have any three-dimensional structure, and may perform any function, known or unknown. Non-limiting examples of nucleic acids include deoxyribonucleic (DNA), ribonucleic acid (RNA), coding or non-coding regions of a gene or gene fragment, loci (locus) defined from linkage analysis, exons, introns, messenger RNA (mRNA), transfer RNA,
ribosomal RNA, short interfering RNA (siRNA), short-hairpin RNA (shRNA), micro-RNA (miRNA), ribozymes, cDNA, recombinant nucleic acids, branched nucleic acids, plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, and primers. A nucleic acid may comprise one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs. If present, modifications to the nucleotide structure may be made before or after assembly of the nucleic acid. The sequence of nucleotides of a nucleic acid may be interrupted by non-nucleotide components. A nucleic acid may be further modified after polymerization, such as by conjugation or binding with a reporter agent. A “variant” nucleic acid is a polynucleotide having a nucleotide sequence identical to that of its original nucleic acid except having at least one nucleotide modified, for example, deleted, inserted, or replaced, respectively. The variant may have a nucleotide sequence at least about 80%, 90%, 95%, or 99%, identity to the nucleotide sequence of the original nucleic acid.
[52] As used herein, the terms “methylation conversion methods” or “methylation enrichment methods” or “methylation conversion agents”, as used herein, generally refers to a method in which a nucleic acid molecule is subjected to conditions sufficient to convert unmethylated cytosines in the nucleic acid molecule to uracils. The methods are useful for differentiating methylated cytosines from unmethylated cytosines in a nucleic acid molecule. Methylation conversion methods or methylation conversion agents can include bisulfite conversion or bisulfite sequencing may be used for DNA methylation analysis. Bisulfite sequencing is a convenient and effective method of mapping DNA methylation to individual bases. Additionally, methylation conversion methods or methylation conversion agents can include enzymatic methylation (EM) conversion. Enzymatic methylation conversion is mediated by non-destructive enzymatic reactions involving a ten-eleven translocation (TET) enzyme and a cytosinedeaminating enzyme (e.g., APOBEC) to convert unmethylated (but not methylated) cytosines to uracils. Other embodiments such as Tet-assisted pyridine borane sequencing (TAPS) combine enzymatic reactions such as TET together with chemical treatments (e.g., pyridine borane).
[53] As used herein, the terms “enzymatic methylation” or “enzymatic methyl” or “EM conversion” or “EM-seq” generally refers to a method in which a nucleic acid molecule is subjected to conditions sufficient to convert unmethylated cytosines in the nucleic acid molecule to uracils by treatment with one or more enzymes. In some cases, the method does not comprise treatment with bisulfite (e.g., chemical treatment).
[54] As used herein, the term “conversion-resistant adapter” generally refers to oligonucleotides used as adapters for nucleic acid library preparation in which one or more of the cytosine (C) residues in the adapter have been replaced with a modified cytosine that is resistant
to methylation conversion or alteration by a methylation enrichment method (e.g., a “deamination-resistant modified cytosine” or a “methylation conversion agent resistant nucleotides”).
[55] As used herein, the terms “deamination-resistant modified cytosine” or “methylation conversion agent resistant nucleotides” generally refers to one or more modified cytosine nucleotides in a conversion-resistant adapter that are not chemically or enzymatically altered by treatment with a methylation conversion agent to change the base pairing specificity of the nucleotide base. By way of example of a non-limiting example, propynyl-C, pyrrolo-C, and 5- methylcytosine (5mC) are deamination-resistant modified cytosines which are conversion resistant nucleotides and can be included in the conversion-resistant adapters used in the library preparation methods disclosed herein in conjunction with sodium bisulfite or enzymatic methylation conversion. Thus, these deamination-resistant modified cytosines within the conversion-resistant adapters are not deaminated when exposed to sodium bisulfite or methylation conversion enzymes.
[56] As used herein, the terms “methylcytosine dioxygenase”, “dioxygenase”, or “oxygenase” generally refer to an enzyme that converts 5mC to 5hmC. Non-limiting examples of methylcytosine dioxygenases include, e.g., ten eleven translocation (TET) enzymes, e.g., TET1, TET2, TET3, Naeglaria TET, and genetically engineered versions and/or variants thereof. TET2 is an example of a methylcytosine dioxygenase that oxidizes at least 90%, at least 92%, at least 94%, at least 96%, at least 98%, or at least 99% of all 5mC. In some embodiments, the methylcytosine dioxygenase enzyme (e.g., TET2) requires cofactors for optimal performance of the oxidation reaction (e.g., conversion of 5mC to 5hmC, 5hmC to 5fC, and 5fC to 5caC). In some embodiments, the cofactor is iron (e.g., Fe, Fe2+, or Fe3+), which can be added to the oxidation reaction alone or together with other reagents that can control the pH and the redox state of the oxidation reaction. In some embodiments, the iron is Fe2+. In some embodiments, the Fe2+ is provided as ferrous salt (e.g., ferrous sulfate or ferrous gluconate).
[57] In certain embodiments, the Fe2+ can be stored frozen at up to lOx the reaction concentration (e.g. 400 pM if the final concentration is 40 pM) for various times ranging from 1 hour to 1 year prior to use. In certain embodiments, the concentrated frozen Fe2+ is allowed to thaw, then freshly diluted to a predetermined stock concentration. The freshly diluted Fe2+ can then be added to the reaction site (e.g., well) in an amount to achieve the desired final reaction concentration of Fe2+ together with the oxidation reaction reagents and sample to initiate the oxidation reaction.
[58] In some embodiments, the Fe2+ is freshly prepared at a desired stock concentration. The freshly prepared Fe2+ can then be added to the reaction site (e.g., well) in an amount to achieve the desired final reaction concentration of Fe2+ together with oxidation reaction reagents and sample to initiate the oxidation reaction.
[59] In some embodiments, Fe2+ is plated in the reaction site (e.g., well) at a desired final reaction concentration (“pre-plated Fe2+”) and then frozen. The plated Fe2+ can be stored frozen for various times ranging from 1 hour to 1 year prior to use. Prior to use, the pre-plated Fe2+ is allowed to thaw and the oxidation reaction reagents and sample are then added to the reaction site to initiate the oxidation reaction.
[60] In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from (IX) to about (36X) the concentration recited the standard protocol (SOP). In certain embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (IX) to about (18X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (IX) to about (12X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration of from about (IX) to about (9X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (IX) to about (6X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (IX) to about (3X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (3X) to about (18X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (3X) to about (12X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (3X) to about (9X) the concentration recited the standard protocol (SOP). In some embodiments, the Fe2+ is added in an amount to have a final reaction concentration from about (6X) to about (9X) the concentration recited the standard protocol (SOP).
[61] In some embodiments, the pH of the oxidation buffer is assessed, then that pH is used to vary the concentration of Fe2+ used in the reaction. In some embodiments, the oxidation state of the other reaction buffers is assessed to vary the concentration of Fe2+ used in the reaction. In some embodiments, the oxidation state of the iron is assessed to vary the concentration of iron used in the reaction.
[62] In some embodiments, the pH of the oxidation reaction is between about 7.5 to about 8.0. In some embodiments, the pH of the oxidation reaction is between about 7.6 to about 7.9. In some embodiments, the pH of the oxidation reaction is between about 7.75 to about 7.85. In other embodiments, the pH of the oxidation reaction is between about 7.5 and about 7.8.
[63] As used herein, the term “cytidine deaminase” generally refers to an enzyme that deaminates cytosine (C) to form uracil (U). Non-limiting examples of cytidine deaminases include the apolipoprotein B mRNA-editing enzyme, catalytic polypeptide (APOBEC) family of cytidine deaminases, such as APOBEC3A. In any embodiment, a cytidine deaminase described herein may have an amino acid sequence that is at least 90% identical to (e.g., at least 95% identical to) the amino acid sequence of GenBank accession number AKE33285.1, which is the sequence of human APOBEC3A. In some embodiments, a cytidine deaminase described herein converts unmodified cytosine to uracil with an efficiency of at least 95%, 98% or 99%, or at least 99%.
[64] As used herein, the terms “glucosyltransferase” or “GT” generally refer to an enzyme that catalyzes the transfer of a beta-D-glucosyl or alpha-D-glucosyl residue from UDP-glucose to 5hmC residue to form 5ghmC. APOBEC can convert 5hmC to U at a low rate relative to converting C or 5mC to U. An example of a GT is T4-betaGT (PGT). In one example, GT may be used concurrently with a dioxygenase. This combination ensures that deamination of 5hmC is blocked such that less than 5%, less than 3%, or less than 1% of 5hmC is converted to U by the deaminase. In another example, GT may be used together with dioxygenase in the same reaction mix with DNA such that the dioxygenase converts 5mC to 5hmC and 5caC, and the GT converts any residual 5hmC to 5ghmC to ensure only cytosine is deaminated.
[65] As used herein, the term “comparing” generally refers to analyzing two or more sequences relative to one another. In some cases, comparing may be performed by aligning two or more sequences with one another such that correspondingly positioned nucleotides are aligned with one another.
[66] As used herein, the term “reference sequence” generally refers to the sequence of a fragment that is being analyzed. A reference sequence may be obtained from a public database or may be separately sequenced as part of an experiment. In some cases, the reference sequence may be hypothetical such that the reference sequence may be computationally deaminated (e.g., to change Cs into Us or Ts etc.) to allow a sequence comparison to be made.
[67] As used herein, the terms “G”, “A”, “T”, “U”, “C”, “5mC”, “5fC”, “5caC”, “5hmC”, and “5ghmC” generally refer to nucleotides that contain guanidine (G), adenine (A), thymine (T), uracil (U), cytosine (C), 5-methylcytosine, 5-formylcytosine, 5-carboxylcytosine (5caC), 5-
hydroxymethylcytosine, and 5-glucosylhydroxymethylcytosine, respectively. For clarity, each of C, 5fC, 5caC, 5mC, and 5ghmC is a different moiety.
[68] The terms “minimal residual disease” or “MRD” generally refer to the small number of cancer cells in the body after cancer treatment. MRD testing may be performed to determine whether the cancer treatment is working and to guide further treatment plans. Various metrics can be used to assess MRD, including, but not limited to, response to treatment, tumor load, residual tumor post-surgery, relapse, secondary screen, primary screen, and cancer progression.
[69] The term “Next Generation Sequencing” or “NGS”, as used herein, generally applies to sequencing libraries of genomic fragments of a size of less than 1 kb.
[70] As used herein, the terms “detect”, “detecting”, or “detection” of a status or outcome generally includes detecting the presence of an indication (such as cancer), detecting status or outcome, or detecting predisposition to a status or outcome.
[71] As used herein, the terms “diagnose” or “diagnosis” of a status or outcome generally includes predicting or diagnosing the status or outcome, determining predisposition to a status or outcome, monitoring treatment of patient, diagnosing a therapeutic response of a patient, prognosis of status or outcome, progression, and response to particular treatment.
[72] As used herein, the terms “healthy” or “normal”, as used herein, generally refer to a subject not having a disease, or a sample derived therefrom. While health is a dynamic state, the term may refer to the pathological state of a subject that lacks a referenced disease state, for example, cancer. In one example, when referring to a methylation profile that classifies subjects with cancer, the term “healthy” generally refers to an individual lacking cancer, such as CRC. While other diseases or states of health may be present in that subject, the term “healthy” may generally indicate the lack of a stated disease for comparison or classification purposes between subjects having and lacking a disease state, and samples derived therefrom.
[73] As used herein, the term “threshold” generally refers to a value that is selected to discriminate, separate, or distinguish between two populations of subjects. In some embodiments, the threshold discriminates methylation status between a disease (e.g., malignant) state, and a non-disease (e.g., healthy) state. In some embodiments, the threshold discriminates between stages of disease (e.g., stage 1, stage 2, stage 3, or stage 4). Thresholds may be set according to the disease in question, and may be based on earlier analysis, e.g., of a training set or determined computationally on a set of inputs having a known characteristic (e.g., healthy, disease, or stage of disease). Thresholds may also be set for a gene region according to the predictive value of methylation at a particular site. Thresholds may be different for each methylation site, and data from multiple sites may be combined in the end analysis.
[74] As used herein, the term “subject” generally refers to an individual, entity or a medium that has or is suspected of having testable or detectable genetic information or material. A subject can be a person, individual, or patient. The subject can be a vertebrate, such as, for example, a mammal. Non-limiting examples of mammals include humans, simians, farm animals, sport animals, rodents, and pets. The subject may be displaying a symptom(s) indicative of a health or physiological state or condition of the subject, such as a cancer or a stage of a cancer of the subject. As an alternative, the subject can be asymptomatic with respect to such health or physiological state or condition.
[75] As used herein, the term “sample” generally refers to a biological sample obtained from or derived from one or more subjects. Biological samples may be cell-free biological samples or substantially cell-free biological samples, or may be processed or fractionated to produce cell- free biological samples. For example, cell-free biological samples may include cell-free ribonucleic acid (cfRNA), cell-free deoxyribonucleic acid (cfDNA), cell-free protein and/or cell- free polypeptides. A biological sample may be tissue (e.g., tissue obtained by biopsy), blood (e.g., whole blood), plasma, serum, sweat, urine, saliva, or a derivative thereof. Cell-free biological samples may be obtained or derived from subjects using an ethylenediaminetetraacetic acid (EDTA) collection tube, a cell-free RNA collection tube (e.g., Streck), or a cell-free DNA collection tube (e.g., Streck). Cell-free biological samples may be derived from whole blood samples by fractionation. Biological samples or derivatives thereof may comprise cells. For example, a biological sample may be a blood sample or a derivative thereof (e.g., blood collected by a collection tube or blood drops), a tumor sample, a tissue sample, a urine sample, or a cell (e.g., tissue) sample.
DETAILED DESCRIPTION
[76] Provided herein are methods and systems for improved nucleic acid library preparation and sequencing of methylated regions for methylation profiling of nucleic acid, such as cell-free deoxyribonucleic acid (cfDNA). The methods and systems may be directed to double-stranded DNA (dsDNA) library preparation and subsequent methylation sequencing and profiling of nucleic acids from a biological sample which can improve DNA library yield, minimize signal loss, reduce biases, improve coverage, uniformity of coverage, resolution, and accuracy of methylation data to support practical applications such as those described herein.
[77] The resulting sequencing data obtained from methods provided herein may be useful for applications that utilize methylation profiling data for classifying or stratifying a population of individuals. Such classifying or stratifying of a population of individuals may include identifying and/or detecting individuals as having a disease, staging disease progression (including detection
of minimal residual disease (MRD) or determining an individual’s response to a particular treatment for a disease.
[78] As set forth herein, methylation analysis may be coupled with DNA sequencing to determine a likelihood that a sample is normal, tumor-derived, or disease-positive. For example, a relative abundance of methylated or unmethylated DNA fragments that map or align to specific genomic regions may be used to detect or determine disease likelihood.
[79] Such sequencing methods may include DNA library preparation in which enzymatic processes are used to attach sequencing adapters to double-stranded DNA (dsDNA) fragments. In some embodiments set forth herein, the dsDNA library preparation methods comprise attaching conversion-resistant adapters to the dsDNA prior to methyl conversion, amplification, and sequencing. These conversion-resistant adapters may comprise modified cytosines that are deamination-resistant and resistant to methylation conversion or alteration by a methylation enrichment method. Thus, the conversion-resistant adapters described herein can comprise one or more conversion-resistant nucleotides (e.g., deamination-resistant modified cytosines) that are not chemically altered by treatment with a methylation conversion agent to change the base pairing specificity of the nucleotide base. For example, conversion-resistant adapters can comprise one or more deamination-resistant modified cytosines comprising propynyl-C, pyrrolo- C, or 5-methylcytosine (5mC), or a combination thereof, for use in the library preparation methods disclosed herein in conjunction with sodium bisulfite or enzymatic methylation conversion. The conversion-resistant deamination-resistant modified cytosines within the adapters are not deaminated when exposed to sodium bisulfite or a methylation conversion enzyme (e.g., a deamination enzyme). In one embodiment, the conversion-resistant adapters can comprise one or more propynyl-C residues. In another embodiment, the conversion-resistant adapters can comprise one or more pyrrolo-C residues. In still another embodiment, the conversion-resistant adapters can comprise one or more 5mC residues. In still another embodiment, the conversion-resistant adapters can comprise a combination of propynyl-C, pyrrolo-C, and 5mC residues. In some embodiments, the conversion-resistant adapters can comprise a combination of propynyl-C and pyrrolo-C residues. In some embodiments, the conversion-resistant adapters can comprise a combination of propynyl-C and 5mC residues. In some embodiments, the conversion-resistant adapters can comprise a combination of pyrrolo-C and 5mC residues.
[80] In certain examples, a prepared cell-free nucleic acid library sequence comprising conversion-resistant adapters can further comprise sequence tags, index barcodes, unique molecular identifiers (UMIs), or combinations thereof that are further ligated to cell-free nucleic
acid sample molecules for use in library preparation for NGS approaches including fields such as epigenetics.
[81] The methods and systems set forth herein can further comprise preparing a doublestranded DNA (dsDNA) sequencing library comprising conversion-resistant adapters and subjecting a sample from the dsDNA library to methylation interrogation treatment method such that unmethylated cytosine bases in the dsDNA are converted to uracil bases.
[82] In certain embodiments, upon attaching the conversion-resistant nucleic acid adapters to the dsDNA, the adapter-ligated dsDNA can be subjected to enzymatic methyl (EM) conversion or bisulfite treatment methods, and amplified (e.g., via polymerase chain reaction (“PCR”)) in order to provide a base-level resolution of DNA methylation via sequencing. In one embodiment, the adapter ligation is performed before methylation base conversion and amplification. In another embodiment, methylation interrogating methods are performed prior to adapter ligation. In certain cases, performing adapter ligation before methylation base conversion and amplification may be used compared to performing adapter ligation after methylation base conversion. Further, enzymatic conversion methods may be used compared to chemical conversion methods (e.g., bisulfite treatment) as chemical treatment may lead to more extensive DNA damage and molecular loss than enzymatic methods.
[83] For certain methods in which dsDNA library preparation is coupled with enzymatic methyl conversion, the conversion-resistant adapter-ligated dsDNA sequences are enzymatically converted, amplified and sequenced after conversion. In such embodiments, the conversionresistant adapter sequences can comprise one or more modified cytosine residues such that they are resistant to conversion by either chemical or enzymatic methylation conversion methods. These conversion-resistant nucleic acid adapters can comprise deamination-resistant modified cytosines comprising propynyl-C, pyrrolo-C, or 5-methylcytosine (5mC), or a combination thereof. In one embodiment, the conversion-resistant adapters can comprise one or more propynyl-C residues. In another embodiment, the conversion-resistant adapters can comprise one or more pyrrolo-C residues. In still another embodiment, the conversion-resistant adapters can comprise one or more 5mC residues. In still another embodiment, the conversion-resistant adapters can comprise a combination of propynyl-C, pyrrolo-C, and 5mC residues. In some embodiments, the conversion-resistant adapters can comprise a combination of propynyl-C and pyrrolo-C residues. In some embodiments, the conversion-resistant adapters can comprise a combination of propynyl-C and 5mC residues. In some embodiments, the conversion-resistant adapters can comprise a combination of pyrrolo-C and 5mC residues.
[84] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Although any methods and materials similar or equivalent to those described herein can also be used in the practice or testing of the present teachings, some exemplary methods and materials are described herein.
[85] The citation of any publication is for its disclosure prior to the filing date and should not be construed as an admission that the present claims are not entitled to antedate such publication by virtue of prior invention. Further, the dates of publication provided can be different from the actual publication dates which can be independently confirmed.
[86] As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which can be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present teachings. Any recited method can be carried out in the order of events recited or in any other order which is logically possible.
DNA Methylation Analysis
[87] Cell-free DNA (cfDNA) sequencing may be a useful tool for cancer detection. DNA methylation analysis may be coupled with sequencing to determine whether a portion of cfDNA is likely to be pre-cancerous or tumor-derived. DNA methylation is a covalent modification of DNA and a stable inherited mark that can play an important role in repressing gene expression and regulating chromatin architecture. In humans, DNA methylation primarily occurs at cytosine residues in CpG dinucleotides. Unlike other dinucleotides, CpGs are not evenly distributed across the genome and can be concentrated in short CpG-rich DNA regions called CpG islands. In general, the majority of the CpG sites in the genome are -70-75% methylated. However, methylation patterns differ from cell type to cell type, reflecting their role in regulating cell typespecific gene expression. In this manner, a cell’s methylome can program the cell’s terminal differentiation state to be, for instance, a neuron, a muscle cell, an immune cell, etc.
[88] Further, various cell sub-types in a tissue can exhibit different methylation patterns. In cancer cells, CpG methylation can be deregulated, and aberrations in methylation patterns are some of the earliest events that occur in tumorigenesis. Methylation profiles in a given cancer type most closely resemble that of the tissue of origin of the cancer. Thus, aberrant methylation marks on a cfDNA fragment can be used to differentiate a cancer cell from a normal cell, and determine tissue type origin. In general, global CpG methylation levels decrease in cancer cells, but at specific loci, mean methylation levels (or % methylation) can vary at specific CpG sites in
cancer cells relative to matched normal cells. Profiling differentially methylated CpGs (DMCs; single sites) or differentially methylated regions (DMRs; more than one site in a localized region) between normal and diseased cells allows identification of biomarkers of the disease. Such approach has led to development of the SEPT9 gene methylation assay (Epi proColon), which is the first FDA-approved blood-based diagnostic for colorectal cancer (CRC).
[89] Bisulfite conversion or bisulfite sequencing may be used for DNA methylation analysis. Bisulfite sequencing is a convenient and effective method of mapping DNA methylation to individual bases. Unfortunately, bisulfite conversion is a harsh and destructive process for cfDNA that leads to degradation of >90% of the sample DNA.
[90] Alternatively, enzymatic methylation (EM) conversion may be used for DNA methylation analysis and sequencing. In one embodiment, methylation conversion is mediated by non-destructive enzymatic reactions involving a ten-eleven translocation (TET) enzyme and a cytosine-deaminating enzyme (e.g., APOBEC) to convert unmethylated (but not methylated) cytosines to uracils. Other embodiments such as Tet-assisted pyridine borane sequencing (TAPS) combine enzymatic reactions such as TET together with chemical treatments (e.g., pyridine borane).
[91] The advent of next generation DNA sequencing offers advances in clinical medicine and basic research. However, while this technology has the capacity to generate hundreds of billions of nucleotides of DNA sequence in a single experiment, the error rate of approximately 1% results in hundreds of millions of sequencing mistakes. Such errors can be tolerated in some applications but become extremely problematic for “deep sequencing” of genetically heterogeneous mixtures, such as tumors or mixed microbial populations. Thus, improved methods for analyzing methylation of cfDNA are needed to preserve the integrity of sample nucleic acid and enable improved accuracy of methylation state analysis at the whole genome or targeted level.
Library Preparation Methods for Methylation Sequencing
[92] Methods are provided herein for the preparation of a sequencing library for methylation sequencing. The methods described herein utilize conversion-resistant adapters to provide a dsDNA library that is acceptable for next generation methylation sequencing applications. The resulting raw sequencing data may be used for methylation state analysis, as well as more conventional cfDNA analysis, such as copy number alterations, germline variant detection, somatic variant detection, nucleosome positioning, transcription factor profiling, chromatin immunoprecipitation, and the like.
[93] In the embodiments described herein for dsDNA library preparation for methylation analysis, one or more of the cytosines present in adapters are replaced with a deaminationresistant modified cytosine selected from propynyl-C, pyrrolo-C, or 5mC to provide conversionresistant adapters used in dsDNA library preparation. As a result of the use of these conversionresistant adapters, methylation conversion of the modified cytosine residues in the adapters is prevented. An advantage of this approach is that by utilizing conversion-resistant adapters, the potential for incomplete TET -mediated oxidation of any methylated cytosine residues in the adapters is reduced or eliminated. By removing the potential for incomplete TET -mediated oxidation of any methylated cytosine residues in the adapters, as accomplished by using conversion-resistant adapters described herein, downstream library amplification yields can be increased and a more even product distribution can be achieved. This is a result of decreased aberrant conversion of methylated cytosines occurring in adapter regions that are then used as template strands for subsequent library amplification. In one embodiment, the conversionresistant adapters can comprise one or more deamination-resistant modified cytosines. In another embodiment, the conversion-resistant adapters can comprise one or more propynyl-C residues. In still another embodiment, the conversion-resistant adapters can comprise one or more pyrrolo-C residues. In other embodiments, the conversion-resistant adapters can comprise one or more 5mC residues. In still another embodiment, the conversion-resistant adapters can comprise a combination of propynyl-C, pyrrolo-C, and 5mC residues.
[94] In one embodiment, the conversion-resistant nucleic acid adapters are ligated to the 5’ and 3’ ends of a population of nucleic acid fragments in a biological sample to produce a sequencing library. In another example, a collection of conversion-tolerant nucleic acid adapters is ligated to the nucleic acid fragments in a sample. In another example, unique dual indexes (UDI) are additional sequences that may be added to the adapters during library preparation. In various examples, the UDI sequences are any length.
Enzymatic conversion for DNA Methylation Sequencing Applications
[95] Bisulfite conversion may be damaging to input DNA and results in overall yield loss, fragmentation and biased sequencing data. As an alternative, enzymatic methyl conversion can be used in methylation sequencing workflows. Examples of enzymatic methyl conversion workflows include enzymatic methyl-seq (EM-seq) and Tet-assisted pyridine borane sequencing (TAPS).
[96] As described herein, the methods for dsDNA library preparation for methylation analysis comprise conversion-resistant adapters wherein one or more of the cytosines present in adapters
are replaced with a deamination-resistant modified cytosine selected from propynyl-C, pyrrolo- C, or 5mC to provide conversion-resistant adapters used in dsDNA library preparation. As a result of the use of these conversion-resistant adapters, methylation conversion of the modified cytosine residues in the adapters is prevented. An advantage of this approach is that by utilizing conversion-resistant adapters, the potential for incomplete TET-mediated oxidation of any methylated cytosine residues in the adapters is reduced or eliminated. By removing the potential for incomplete TET-mediated oxidation of any methylated cytosine residues in the adapters, as accomplished by using conversion-resistant adapters described herein, downstream library amplification yields can be increased, and more uniform library amplification yields can be achieved as a result of decreased aberrant conversion of methylated cytosines occurring in adapter regions that are then used as template strands for subsequent library amplification.
[97] EM-seq is a minimally destructive conversion methylation sequencing method for converting cytosines to uracil in nucleic acid. This bi sulfite-free method preserves the length of nucleic acid molecules while achieving conversion rates similar to bisulfite sequencing. Further EM-Seq can result in higher sequencing quality scores for cytosines and guanine base pairs and can provide a more even coverage of various genomic features, such as CpG islands. EM-Seq comprises two sets of enzymatic reactions. In the initial reaction, a ten-eleven translocation (TET) enzyme (e.g., TET1, TET2, TET3, Naeglaria TET, and genetically engineered versions and/or variants thereof) and a CE<-glucosyltransferase (e.g., T4-BGT) convert 5mC and 5hmC into products that cannot be deaminated by cytosine-deaminating enzyme (e.g., APOBEC). In the second reaction, a cytosine-deaminating enzyme (e.g., APOBEC) deaminates unmodified cytosines by converting them to uracils.
[98] In another embodiment, Tet-assisted pyridine borane sequencing (TAPS) can be used in enzymatic methylation sequencing workflows.
[99] In TAPS, a ten-eleven translocation enzyme (TET1) is used to oxidize both 5mC and 5hmC to 5caC. Pyridine borane is used to reduce 5caC to dihydrouracil, a uracil derivative that is then converted to thymine after PCR. TAPS can be performed in two other ways: TAPSP and chemical-assisted pyridine borane sequencing (CAPS). In TAPSP, P-glucosyltransferase is used to label 5hmC with glucose to protect 5hmC from the oxidation and reduction reactions and allows for specific detection of 5mC. In CAPS, potassium perruthenate acts as the chemical replacement for TET1 and specifically oxidizes 5hmC, thus allowing for direct detection.
[100] In one example of enzymatic conversion, the combination of enzymatic conversion of unmodified C to U, and staggering UMI adapters in line with the library insert, can be useful for targeted sequencing of methylation libraries. For low-depth sequencing applications, this
combination may permit reduced volume inputs of plasma or mass inputs of cfDNA as compared to bisulfite conversion sequencing because sample cfDNA is not degraded to the same extent.
[101] For high-depth sequencing applications, higher depth sequencing may be obtained as compared to bisulfite conversion sequencing from similar inputs of plasma or cfDNA because cfDNA is not degraded to the same extent.
[102] In one embodiment of the present inventive concepts, a method is provided for performing methylation sequencing of a cell-free DNA (cfDNA) sample from a subject, comprising: a) extracting a cfDNA sample from the subject, wherein said cfDNA comprises doublestranded nucleic acid (dsDNA) molecules; b) ligating conversion-resistant adapters comprising one or deamination-resistant modified cytosines to said dsDNA molecules, wherein the dsDNA molecules comprise one or more unmethylated cytosine residues; c) enzymatically converting the one or more unmethylated cytosine residues to uracils in the dsDNA molecules ligated to the conversion-resistant adapters; d) amplifying the converted dsDNA molecules comprising the conversion-resistant adapters of c); and e) determining the nucleic acid sequence of the amplified dsDNA molecules comprising the conversion-resistant adapters of c) of d) at a depth of >50x.
[103] In one example, the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of between about 50x to about 500x, about 25x to about lOOOx, about 50x to about 500x, about 250x to about 750x, about 500x to about 200x, about 750x to about 1500x, or about lOOx to about 2000x. In some embodiments, a nucleic acid sequence is sequenced at a depth of >100x or >500x. In some embodiments, the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of about 50x, about 60x, about 70x, about 80x, about 90x, about lOOx, about 11 Ox, about 120x, about 13 Ox, about 140x, about 15 Ox, about 160x, about 170x, about 180x, about 190x, about 200x, about 210x, about 220x, about 230x, about 240x, about 250x, about 275x, about 300x, about 325x, about 350x, about 375x, about 400x, about 425x, about 450x, about 475x, about 500x, about 525x, about 550x, about 575x, about 600x, about 625x, about 650x, about 675x, about 700x, about 725x, about 750x, about 775x, about 800x, about 825x, about 850x, about 875x, about 900x, about 925x, about 950x, about 975x, or about lOOOx. In some embodiments, the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of about 1 lOOx, about 1200x, about 1300x, about 1400x, about 1500x, about 1600x, about 1700x, about 1800x, about 1900x, or about 2000x.
[104] In one example, the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of about 500x, about lOOOx, about 2000x, about 3000x, about 4000x, about 5000x, about 6000x, about 7000x, about 8000x, about 9000x, about lOOOOx, or greater than 5000x.
[105] In one example, the nucleic acid sequence of the converted nucleic acid molecules is sequenced at a depth of about 300x unique, about 400x unique, about 500x unique, about 600x unique, about 700x unique, about 800x unique, about 900x unique, or about lOOOx unique, or greater than 500x unique.
Methylation Analysis
[106] In various examples, enzymatic methylation sequencing results generates using the dsDNA library preparation methods described herein are used to analyze the methylation state of nucleic acids in a biological sample. In one example, whole genome enzymatic methyl sequencing ("WG EM-seq") provides high resolution sequencing by characterizing DNA methylation of nearly every cytidine nucleotide in the genome. Other targeted methods, such as targeted enzymatic methyl sequencing ("TEM-seq"), may be useful for methylation analysis.
[107] In other examples, assays that have conventionally been used for bisulfite conversion can be employed for minimally-destructive conversion methods, such as enzymatic conversion, TAPS, and CAPS. In various examples, assays used for methylation analysis may be mass spectrometry, methylation-specific PCR (MSP), reduced representation bisulfite sequencing (RRBS), HELP assay, GLAD-PCR assay, ChlP-on-chip assays, restriction landmark genomic scanning, methylated DNA immunoprecipitation (MeDIP), pyrosequencing of bisulfite treated DNA, molecular break light assay, methyl sensitive Southern Blotting, High Resolution Melt Analysis (HRM or HRMA), ancient DNA methylation reconstruction, or Methylation Sensitive Single Nucleotide Primer Extension Assay (msSNuPE).
[108] The methylation profile of cfDNA can then be identified by applying sequence alignment methods to map methyl-seq reads from whole genome or targeted methyl sequencing of a human reference genome. Non-limiting examples of sequence alignment methods include bwa-meth, bismark, Last, GSNAP, BSMAP, NovoAlign, Bison, Metagenomic Phylogenetic Analysis (for example, MetaPhlAn2), BLAT, Burrows-Wheeler Aligner (BWA), Bowtie, Bowtie2, Bfast, BioScope, CLC bio, Cloudburst, ElandZEland2, GenomeMapper, GnuMap, Karma, MAQ, MOM, Mosaik, MrFAST/MrsFAST, PASS, PerM, RazerS, RMAP, SSAHA2, Segemehl, SeqMap, SHRiMP, Slider/Sliderll, Srprism, Stampy, vmatch, ZOOM, and the SOAP/SOAP2 alignment tool.
COMPUTER SYSTEMS AND MACHINE LEARNING METHODS
A. Sample Features
[109] As used herein, as it relates to machine learning and pattern recognition the term "feature", as used herein, may generally refer to an individual measurable property or characteristic of a phenomenon being observed. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition. The concept of "feature" is related to that of explanatory variable used in statistical techniques such as linear regression.
[HO] In one embodiment, the features are inputted into a feature matrix for machine learning analysis.
[Hl] For a plurality of assays, the system identifies feature sets to input to a machine learning model. The system performs an assay on each molecule class and forms a feature vector from the measured values. The system inputs the feature vector into the machine learning model and obtains an output classification of whether the biological sample has a specified property.
[112] In one embodiment, the machine learning model outputs a classifier that distinguishes between two groups or classes of individuals or features in a population of individuals or features of the population. In one embodiment, the classifier is a trained machine learning classifier.
[113] In one embodiment, the informative loci or features of biomarkers in a cancer tissue are assayed to form a profile. Receiver Operating Characteristic (ROC) curves are useful for plotting the performance of a particular feature (e.g., any of the biomarkers described herein and/or any item of additional biomedical information) in distinguishing between two populations (e.g., individuals responding and not responding to a therapeutic agent). The feature data across the entire population (e.g., the cases and controls) may be sorted in ascending order based on the value of a single feature.
[114] In some embodiments, the condition is advanced adenoma (AA), colorectal cancer (CRC), colorectal carcinoma, or inflammatory bowel disease.
[115] The term "input features" or "features" as used herein generally refers to variables that are used by the model to predict an output classification (label) of a sample, e.g., a condition, sequence content (e.g., mutations), suggested data collection operations, or suggested treatments. Values of the variables can be determined for a sample and used to determine a classification. Example of input features of genetic data include: aligned variables that relate to alignment of sequence data (e.g., sequence reads) to a genome and non-aligned variables, e.g., that relate to the sequence content of a sequence read, a measurement of protein or autoantibody, or the mean
methylation level at a genomic region.
[116] Values of the variables can be determined for a sample and used to determine a classification. Example of input features of genetic data include: aligned variables that relate to alignment of sequence data (e.g., sequence reads) to a genome and non-aligned variables, e.g., that relate to the sequence content of a sequence read, a measurement of protein or autoantibody, or the mean methylation level at a genomic region. In various examples, genetic features such as, V-plot measures, FREE-C, the cfDNA measurement over a transcription start site and DNA methylation levels over cfDNA fragments are used as input features for machine learning methods and models.
[117] In one example, the sequencing information includes information regarding a plurality of genetic features such as, but not limited to, transcription start sites, transcription factor binding sites, chromatin open and closed states, nucleosomal positioning or occupancy, and the like.
B. Data Analysis
[118] In some embodiments, the present disclosure provides a system, method, or kit having data analysis realized in software applications, computing hardware, or both. In various embodiments, the analysis application or system includes at least a data receiving module, a data pre-processing module, a data analysis module (which can operate on one or more types of genomic data), a data interpretation module, or a data visualization module. In one embodiment, the data receiving module can comprise computer systems that connect laboratory hardware or instrumentation with computer systems that process laboratory data. In one embodiment, the data pre-processing module can comprise hardware systems or computer software that performs operations on the data in preparation for analysis. Examples of operations that can be applied to the data in the pre-processing module include affine transformations, denoising operations, data cleaning, reformatting, or subsampling. A data analysis module, which can be specialized for analyzing genomic data from one or more genomic materials, can, for example, take assembled genomic sequences and perform probabilistic and statistical analysis to identify abnormal patterns related to a disease, pathology, state, risk, condition, or phenotype. A data interpretation module can use analysis methods, for example, drawn from statistics, mathematics, or biology, to support understanding of the relation between the identified abnormal patterns and health conditions, functional states, prognoses, or risks. A data visualization module can use methods of mathematical modeling, computer graphics, or rendering to create visual representations of data that can facilitate the understanding or interpretation of results.
[119] In various embodiments, machine learning methods are applied to distinguish samples in a population of samples. In one embodiment, machine learning methods are applied to distinguish samples between healthy and advanced adenoma samples.
[120] In one embodiment, the one or more machine learning operations used to train the methylation-based prediction engine include one or more of: a generalized linear model, a generalized additive model, a non-parametric regression operation, a random forest classifier, a spatial regression operation, a Bayesian regression model, a time series analysis, a Bayesian network, a Gaussian network, a decision tree learning operation, an artificial neural network, a recurrent neural network, a reinforcement learning operation, linear/non-linear regression operations, a support vector machine, a clustering operation, and a genetic algorithm operation.
[121] In various embodiments, computer processing methods are selected from logistic regression, multiple linear regression (MLR), dimension reduction, partial least squares (PLS) regression, principal component regression, autoencoders, variational autoencoders, singular value decomposition, Fourier bases, wavelets, discriminant analysis, support vector machine, decision tree, classification and regression trees (CART), tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, multidimensional scaling (MDS), dimensionality reduction methods, t-distributed stochastic neighbor embedding (t-SNE), multilayer perceptron (MLP), network clustering, neuro-fuzzy, and artificial neural networks.
[122] In some embodiments, the methods disclosed herein can include computational analysis on nucleic acid sequencing data of samples from an individual or from a plurality of individuals. An analysis can identify a variant inferred from sequence data to identify sequence variants based on probabilistic modeling, statistical modeling, mechanistic modeling, network modeling, or statistical inferences. Non-limiting examples of analysis methods include principal component analysis, autoencoders, singular value decomposition, Fourier bases, wavelets, discriminant analysis, regression, support vector machines, tree-based methods, networks, matrix factorization, and clustering. Non-limiting examples of variants include a germline variation or a somatic mutation. In some embodiments, a variant can refer to an already-known variant. The already-known variant can be scientifically confirmed or reported in literature. In some embodiments, a variant can refer to a putative variant associated with a biological change. A biological change can be known or unknown. In some embodiments, a putative variant can be reported in literature, but not yet biologically confirmed.
[123] Alternatively, a putative variant is not reported in literature, but can be inferred based on a computational analysis disclosed herein. In some embodiments, germline variants can refer to nucleic acids that induce natural or normal variations.
[124] Natural or normal variations can include, for example, skin color, hair color, and normal weight. In some embodiments, somatic mutations can refer to nucleic acids that induce acquired or abnormal variations. Acquired or abnormal variations can include, for example, cancer, obesity, conditions, symptoms, diseases, and disorders. In some embodiments, the analysis can include distinguishing between germline variants. Germline variants can include, for example, private variants and somatic mutations. In some embodiments, the identified variants can be used by clinicians or other health professionals to improve health care methodologies, accuracy of diagnoses, and cost reduction.
[125] Also provided herein are improved methods and computing systems or software media that can distinguish among sequence errors in nucleic acid introduced through amplification and/or sequencing techniques, somatic mutations, and germline variants. Methods provided can include simultaneously calling and scoring variants from aligned sequencing data of all samples obtained from a patient.
[126] Samples obtained from subjects other than the patient can also be used. Other samples can also be collected from subjects previously analyzed by a sequencing assay or a targeted sequencing assay (e.g., a targeted resequencing assay). Methods, computing systems, or software media disclosed herein can improve identification and accuracy of variations or mutations (e.g., germline or somatic, including copy number variations, single nucleotide variations, indels, a gene fusions), and lower limits of detection by reducing the number of false positive and false negative identifications.
C. Classifier Generation
[127] In one aspect, the present systems and methods provide a classifier generated based on feature information derived from methylation sequence analysis from biological samples of cfDNA prepared by the ssDNA library preparation methods described herein. The classifier forms part of a predictive engine for distinguishing groups in a population based on methylation sequence features identified in biological samples such as cfDNA.
[128] In one embodiment, a classifier is created by normalizing the methylation information by formatting similar portions of the methylation information into a unified format and a unified scale; storing the normalized methylation information in a columnar database; training a methylation prediction engine by applying one or more one machine learning operations to the stored normalized methylation information, the methylation prediction engine mapping, for a particular population, a combination of one or more features; applying the methylation prediction engine to the accessed field information to identify a methylation associated with a group; and
classifying the individual into a group.
[129] Specificity may be defined as the probability of a negative test among those who are free from the disease. Specificity is equal to the number of disease-free persons who tested negative divided by the total number of disease-free individuals.
[130] In various embodiments, the model, classifier, or predictive test has a specificity of at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
[131] Sensitivity may be defined as the probability of a positive test among those who have the disease. Sensitivity is equal to the number of diseased individuals who tested positive divided by the total number of diseased individuals.
[132] In various embodiments, the model, classifier, or predictive test has a sensitivity of at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
[133] In one embodiment, the group is selected from healthy (asymptomatic), cancer, gut- associated diseases, immune-mediated inflammatory diseases, neurological diseases, kidney diseases, prenatal diseases, and metabolic diseases.
D. Digital Processing Device
[134] In some embodiments, the subject matter described herein can include a digital processing device or use of the same. In some embodiments, the digital processing device can include one or more hardware central processing units (CPU), graphics processing units (GPU), or tensor processing units (TPU) that carry out the device's functions. In some embodiments, the digital processing device can include an operating system configured to perform executable instructions. In some embodiments, the digital processing device can optionally be connected a computer network. In some embodiments, the digital processing device can be optionally connected to the Internet such that it accesses the World Wide Web. In some embodiments, the digital processing device can be optionally connected to a cloud computing infrastructure. In some embodiments, the digital processing device can be optionally connected to an intranet. In some embodiments, the digital processing device can be optionally connected to a data storage device.
[135] Non-limiting examples of suitable digital processing devices include server computers, desktop computers, laptop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top computers, handheld computers, Internet appliances, mobile smartphones, and tablet computers. Suitable tablet computers can include, for example,
those with booklet, slate, and convertible configurations.
[136] In some embodiments, the digital processing device can include an operating system configured to perform executable instructions. For example, the operating system can include software, including programs and data, which manages the device's hardware and provides services for execution of applications. Non-limiting examples of operating systems include FreeBSD, OpenBSD, NetBSD®, Linux, Apple® Mac OS X Server®, Oracle® Solaris®, Windows Server®, and Novell® NetWare®. Non-limiting examples of suitable personal computer operating systems include Microsoft® Windows®, Apple® Mac OS X®, UNIX®, and UNIX-like operating systems such as GNU/Linux®. In some embodiments, the operating system can be provided by cloud computing, and cloud computing resources can be provided by one or more service providers.
[137] In some embodiments, the device can include a storage and/or memory device. The storage and/or memory device can be one or more physical apparatuses used to store data or programs on a temporary or permanent basis. In some embodiments, the device can be volatile memory and require power to maintain stored information. In some embodiments, the device can be non-volatile memory and retain stored information when the digital processing device is not powered. In some embodiments, the non-volatile memory can include flash memory. In some embodiments, the non-volatile memory can include dynamic random-access memory (DRAM). In some embodiments, the non-volatile memory can include ferroelectric random access memory (FRAM). In some embodiments, the non-volatile memory can include phase-change random access memory (PRAM). In some embodiments, the device can be a storage device including, for example, CD-ROMs, DVDs, flash memory devices, magnetic disk drives, magnetic tapes drives, optical disk drives, and cloud computing-based storage. In some embodiments, the storage and/or memory device can be a combination of devices such as those disclosed herein.
[138] In some embodiments, the digital processing device can include a display to send visual information to a user. In some embodiments, the display can be a cathode ray tube (CRT). In some embodiments, the display can be a liquid crystal display (LCD). In some embodiments, the display can be a thin film transistor liquid crystal display (TFT-LCD). In some embodiments, the display can be an organic light emitting diode (OLED) display. In some embodiments, on OLED display can be a passive-matrix OLED (PMOLED) or active-matrix OLED (AMOLED) display. In some embodiments, the display can be a plasma display. In some embodiments, the display can be a video projector. In some embodiments, the display can be a combination of devices such as those disclosed herein.
[139] In some embodiments, the digital processing device can include an input device to receive information from a user. In some embodiments, the input device can be a keyboard. In some embodiments, the input device can be a pointing device including, for example, a mouse, trackball, track padjoystick, game controller, or stylus. In some embodiments, the input device can be a touch screen or a multi-touch screen. In some embodiments, the input device can be a microphone to capture voice or other sound input. In some embodiments, the input device can be a video camera to capture motion or visual input. In some embodiments, the input device can be a combination of devices such as those disclosed herein.
Non-transitory computer-readable storage medium
[140] In some embodiments, the subject matter disclosed herein can include one or more non- transitory computer-readable storage media encoded with a program including instructions executable by the operating system of an optionally networked digital processing device. In some embodiments, a computer-readable storage medium can be a tangible component of a digital processing device. In some embodiments, a computer-readable storage medium can be optionally removable from a digital processing device. In some embodiments, a computer- readable storage medium can include, for example, CD-ROMs, DVDs, flash memory devices, solid state memory, magnetic disk drives, magnetic tape drives, optical disk drives, cloud computing systems and services, and the like. In some embodiments, the program and instructions can be permanently, substantially permanently, semi-permanently, or non- transitorily encoded on the media.
Computer Systems
[141] The present disclosure provides computer systems that are programmed to implement methods of the disclosure. FIG. 7 shows a computer system 701 that is programmed or otherwise configured to store, process, identify, or interpret patient data, biological data, biological sequences, or reference sequences. The computer system 701 can process various aspects of patient data, biological data, biological sequences, or reference sequences of the present disclosure. The computer system 701 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.
[142] The computer system 701 includes a central processing unit (CPU, also "processor" and "computer processor" herein) 705, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 701 also includes memory
or memory location 710 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 715 (e.g., hard disk), communication interface 720 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 725, such as cache, other memory, data storage and/or electronic display adapters. The memory 710, storage unit 715, interface 720, and peripheral devices 725 are in communication with the CPU 705 through a communication bus (solid lines), such as a motherboard. The storage unit 715 can be a data storage unit (or data repository) for storing data. The computer system 701 can be operatively coupled to a computer network ("network") 730 with the aid of the communication interface 720. The network 730 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 730 in some embodiments is a telecommunication and/or data network. The network 730 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 730, in some embodiments with the aid of the computer system 701, can implement a peer-to-peer network, which may enable devices coupled to the computer system 701 to behave as a client or a server.
[143] The CPU 705 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 710. The instructions can be directed to the CPU 705, which can subsequently program or otherwise configure the CPU 705 to implement methods of the present disclosure. Examples of operations performed by the CPU 705 can include fetch, decode, execute, and writeback.
[144] The CPU 705 can be part of a circuit, such as an integrated circuit. One or more other components of the system 701 can be included in the circuit. In some embodiments, the circuit is an application specific integrated circuit (ASIC).
[145] The storage unit 715 can store files, such as drivers, libraries, and saved programs. The storage unit 715 can store user data, e.g., user preferences and user programs. The computer system 701 in some embodiments can include one or more additional data storage units that are external to the computer system 701, such as located on a remote server that is in communication with the computer system 701 through an intranet or the Internet.
[146] The computer system 701 can communicate with one or more remote computer systems through the network 730. For instance, the computer system 701 can communicate with a remote computer system of a user. Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PCs (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital
assistants. The user can access the computer system 701 via the network 730.
[147] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 701, such as, for example, on the memory 710 or electronic storage unit 715. The machine executable or machine-readable code can be provided in the form of software. During use, the code can be executed by the processor 705. In some embodiments, the code can be retrieved from the storage unit 715 and stored on the memory 710 for ready access by the processor 705. In some embodiments, the electronic storage unit 715 can be precluded, and machine-executable instructions are stored on memory 710.
[148] The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code or can be interpreted or compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a precompiled, interpreted, or as-compiled fashion.
[149] Aspects of the systems and methods provided herein, such as the computer system 701, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” and may be in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. "Storage" type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible "storage" media, terms such as computer or machine "readable medium" refer to any medium that participates in providing instructions to a processor for execution.
[150] Hence, a machine readable medium, such as computer-executable code, may take many
forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as the main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
[151] The computer system 701 can include or be in communication with an electronic display 735 that comprises a user interface (UI) 740 for providing, for example, a nucleic acid sequence, an enriched nucleic acid sample, an expression profile, and an analysis of an expression profile. Examples of UI's include, without limitation, a graphical user interface (GUI) and web-based user interface.
[152] Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 705. The algorithm can, for example, probe a plurality of regulatory elements, sequence a nucleic acid sample, enrich a nucleic acid sample, determine an expression profile of a nucleic acid sample, analyze an expression profile of a nucleic acid sample, and archive or disseminate results of analysis of an expression profile.
[153] In some embodiments, the subject matter disclosed herein can include at least one computer program or use of the same. A computer program can be a sequence of instructions, executable in the digital processing device's CPU, GPU, or TPU, written to perform a specified task. Computer-readable instructions can be implemented as program modules, such as functions, objects, Application Programming Interfaces (APIs), data structures, and the like, that perform particular tasks or implement particular abstract data types. In light of the disclosure
provided herein, those having ordinary skill in the art will recognize that a computer program can be written in various versions of various languages.
[154] The functionality of the computer-readable instructions can be combined or distributed as desired in various environments. In some embodiments, a computer program can include one sequence of instructions. In some embodiments, a computer program can include a plurality of sequences of instructions. In some embodiments, a computer program can be provided from one location. In some embodiments, a computer program can be provided from a plurality of locations. In some embodiments, a computer program can include one or more software modules. In some embodiments, a computer program can include, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extensions, add-ins or add-ons, or combinations thereof.
[155] In some embodiments, computer processing can be a method of statistics, mathematics, biology, or any combination thereof. In some embodiments, the computer processing method includes a dimension reduction method including, for example, logistic regression, dimension reduction, principal component analysis, autoencoders, singular value decomposition, Fourier bases, singular value decomposition, wavelets, discriminant analysis, support vector machine, tree-based methods, random forest, gradient boost tree, logistic regression, matrix factorization, network clustering, and neural network.
[156] In some embodiments, the computer processing method is a supervised machine learning method including, for example, a regression, support vector machine, tree-based method, and network.
[157] In some embodiments, the computer processing method is an unsupervised machine learning method including, for example, clustering, network, principal component analysis, and matrix factorization.
Databases
[158] In some embodiments, the subject matter disclosed herein can include one or more databases, or use of the same to store patient data, biological data, biological sequences, or reference sequences. Reference sequences can be derived from a database. In view of the disclosure provided herein, those having ordinary skill in the art will recognize that many databases can be suitable for storage and retrieval of the sequence information. In some embodiments, suitable databases can include, for example, relational databases, non-relational databases, object-oriented databases, object databases, entity -relationship model databases, associative databases, and XML databases. In some embodiments, a database can be internetbased. In some embodiments, a database can be web-based. In some embodiments, a database
can be cloud computing-based. In some embodiments, a database can be based on one or more local computer storage devices.
CANCER DETECTION AND DIAGNOSIS
[159] The trained machine learning methods, models, and discriminate classifiers described herein are useful for various medical applications including cancer detection, diagnosis, and treatment responsiveness. As models are trained with individual metadata and analyte-derived features, the applications may be tailored to stratify individuals in a population and guide treatment decisions accordingly.
A. Detection/Diagnosis
[160] Methods and systems provided herein may perform predictive analytics using artificial intelligence-based approaches to analyze acquired data from a subject (patient) to generate an output of the detection and/or diagnosis of a subject having a cancer (e.g., CRC). For example, the application may apply a prediction algorithm to the acquired data to generate the detection of cancer thereby providing a diagnosis that the subject having the cancer. The prediction algorithm may comprise an artificial intelligence-based predictor, such as a machine learning-based predictor, configured to process the acquired data to generate the diagnosis of the subject having the cancer.
[161] The machine learning predictor may be trained using datasets, e.g., datasets generated by performing multi-analyte assays of biological samples of individuals, from one or more sets of cohorts of patients having cancer as inputs and known diagnosis (e.g., staging and/or tumor fraction) outcomes of the subjects as outputs to the machine learning predictor.
[162] Training datasets (e.g., datasets generated by performing multi-analyte assays of biological samples of individuals) may be generated from, for example, one or more sets of subjects having common characteristics (features) and outcomes (labels). Training datasets may comprise a set of features and labels corresponding to the features relating to diagnosis. Features may comprise characteristics such as, for example, certain ranges or categories of cfDNA assay measurements, such as counts of cfDNA fragments in a biological sample obtained from a healthy and disease samples that overlap or fall within each of a set of bins (genomic windows) of a reference genome. For example, a set of features collected from a given subject at a given time point may collectively serve as a diagnostic signature, which may be indicative of an identified cancer of the subject at the given time point. Characteristics may also include labels indicating the subject's diagnostic outcome, such as for one or more cancers.
[163] Labels may comprise outcomes such as, for example, a known diagnosis (e.g., staging
and/or tumor fraction) outcomes of the subject. Outcomes may include a characteristic associated with the cancers in the subject. For example, characteristics may be indicative of the subject having one or more cancers.
[164] Training sets (e.g., training datasets) may be selected by random sampling of a set of data corresponding to one or more sets of subjects (e.g., retrospective and/or prospective cohorts of patients having or not having one or more cancers). Alternatively, training sets (e.g., training datasets) may be selected by proportionate sampling of a set of data corresponding to one or more sets of subjects (e.g., retrospective and/or prospective cohorts of patients having or not having one or more cancers). Training sets may be balanced across sets of data corresponding to one or more sets of subjects (e.g., patients from different clinical sites or trials). The machine learning predictor may be trained until certain pre-determined conditions for accuracy or performance are satisfied, such as having minimum desired values corresponding to diagnostic accuracy measures. For example, the diagnostic accuracy measure may correspond to prediction of a diagnosis, staging, or tumor fraction of one or more cancers in the subject.
[165] Examples of detection and diagnostic accuracy measures may include sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and area under the curve (AUC) of a ROC curve corresponding to the diagnostic accuracy of detecting or predicting the cancer (e.g., colorectal cancer).
[166] In another aspect, the present disclosure provides a method for detecting or identifying a cancer in a subject, comprising: (a) providing a biological sample comprising ssDNA molecules of a cell-free DNA sample derived from said subject; (b) methylation sequencing said ssDNA molecules from said subject to generate a plurality of sequencing reads; (c) aligning said sequencing reads to a reference genome; (d) generating a quantitative measure of said sequencing reads at each of a first plurality of genomic regions of said reference genome to generate a first feature set, wherein said first plurality of genomic regions of said reference genome comprises at least about 10 distinct regions, each of said at least about 10 distinct regions; and (e) applying a trained algorithm to said first feature set to generate a likelihood of said subject having said cancer.
[167] For example, such a pre-determined condition may be that the sensitivity of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
[168] As another example, such a pre-determined condition may be that the specificity of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
[169] As another example, such a pre-determined condition may be that the positive predictive value (PPV) of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
[170] As another example, such a pre-determined condition may be that the negative predictive value (NPV) of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of, for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
[171] As another example, such a pre-determined condition may be that the AUC of a ROC curve of predicting the cancer (e.g., colorectal cancer, breast cancer, pancreatic cancer, or liver cancer) comprises a value of at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99.
[172] In some examples of any of the foregoing aspects, a method further comprises monitoring a progression of a disease in the subject, wherein the monitoring is based at least in part on the genetic sequence feature. In some examples, the disease is a cancer.
[173] In some examples of any of the foregoing aspects, a method further comprises determining the tissue-of-origin of a cancer in the subject, wherein the determining is based at least in part on the genetic sequence feature.
[174] In some examples of any of the foregoing aspects, a method further comprises estimating a tumor burden in the subject, wherein the estimating is based at least in part on the genetic sequence feature.
Treatment Responsiveness
[175] The predictive classifiers, systems and methods described herein are useful for classifying populations of individuals for a number of clinical applications (e.g., based on performing multi-analyte assays of biological samples of individuals). Examples of such clinical applications include, detecting early-stage cancer, diagnosing cancer, classifying cancer to a particular stage of disease, or determining responsiveness or resistance to a therapeutic agent for treating cancer.
[176] The methods and systems described herein are applicable to various cancer types, similar to grade and stage, and as such, is not limited to a single cancer disease type. Therefore, combinations of analytes and assays may be used in the present systems and methods to predict responsiveness of cancer therapeutics across different cancer types in different tissues and classifying individuals based on treatment responsiveness. In one example, the classifiers described herein stratify a group of individuals into treatment responders and non-responders.
[177] The present disclosure also provides a method for determining a drug target of a condition or disease of interest (e.g., genes that are relevant/important for a particular class), comprising assessing a sample obtained from an individual for the level of gene expression for at least one gene; using a neighborhood analysis routine to determine genes that are relevant for classification of the sample, thereby ascertaining one or more drug targets relevant to the classification.
[178] The present disclosure also provides a method for determining the efficacy of a drug designed to treat a disease class, comprising obtaining a sample from an individual having the disease class; subjecting the sample to the drug; assessing the drug-exposed sample for the level of gene expression for at least one gene; and using a computer model built with a weighted voting scheme to classify the drug-exposed sample into a class of the disease as a function of relative gene expression level of the sample with respect to that of the model.
[179] The present disclosure also provides a method for determining the efficacy of a drug designed to treat a disease class, wherein an individual has been subjected to the drug, comprising obtaining a sample from the individual subjected to the drug; assessing the sample for the level of gene expression for at least one gene; and using a model built with a weighted voting scheme to classify the sample into a class of the disease including evaluating the gene expression level of the sample as compared to gene expression level of the model.
[180] Yet another application is a method of determining whether an individual belongs to a phenotypic class (e.g., intelligence, response to a treatment, length of life, likelihood of viral infection or obesity) that comprises obtaining a sample from the individual; assessing the sample
for the level of gene expression for at least one gene; and using a model built with a weighted voting scheme, classifying the sample into a class of the disease including evaluating the gene expression level of the sample as compared to gene expression level of the model.
[181] Biomarkers may be useful for predicting prognosis of patients with colon cancer. The ability to classify patients as high-risk (poor prognosis) or low-risk (favorable prognosis) may enable selection of appropriate therapies for these patients. For example, high-risk patients are likely to benefit from aggressive therapy, whereas therapy may have no significant advantage for low-risk patients.
[182] Predictive biomarkers that can guide treatment decisions by identifying subsets of patients who may be "exceptional responders" to specific cancer therapies, or individuals who may benefit from alternative treatment modalities.
[183] In one aspect, the systems and methods described herein that relate to classifying a population based on treatment responsiveness refer to cancers that are treated with chemotherapeutic agents of the classes DNA damaging agents, DNA repair target therapies, inhibitors of DNA damage signaling, inhibitors of DNA damage induced cell cycle arrest, and inhibition of processes indirectly leading to DNA damage, but not limited to these classes. Each of these chemotherapeutic agents may be considered a "DNA-damage therapeutic agent".
[184] The patient's analyte data are classified in high-risk and low-risk patient groups, such as patient with a high-risk or low-risk of clinical relapse, and the results may be used to determine a course of treatment. For example, a patient determined to be a high-risk patient may be treated with adjuvant chemotherapy after surgery. For a patient deemed to be a low-risk patient, adjuvant chemotherapy may be withheld after surgery. Accordingly, the present disclosure provides, in certain aspects, a method for preparing a gene expression profile of a colon cancer tumor that is indicative of risk of recurrence.
[185] In various examples, the classifiers described herein stratify a population of individuals between responders and non-responders to treatment.
[186] In various examples, the treatment is selected from alkylating agents, plant alkaloids, antitumor antibiotics, antimetabolites, topoisomerase inhibitors, retinoids, checkpoint inhibitor therapy, and VEGF inhibitors.
[187] Examples of treatments for which a population may be stratified into responders and non- responders include but are not limited to: chemotherapeutic agents including sorafenib, regorafenib, imatinib, eribulin, gemcitabine, capecitabine, pazopanib, lapatinib, dabrafenib, sunitinib, crizotinib, everolimus, torisirolimus, sirolimus, axitinib, gefitinib, anastrozole, bicalutamide, fulvestrant, raltitrexed, pemetrexed, goserelin acetate, erlotinib, vemurafenib,
vismodegib, tamoxifen citrate, paclitaxel, docetaxel, cabazitaxel, oxaliplatin, ziv-aflibercept, bevacizumab, trastuzumab, pertuzumab, panitumumab, taxane, bleomycin, melphalan, plumbagin, camptosar, mitomycin-C, mitoxantrone, poly(styrene-maleic acid)-conjugated neocarzinostatin (SMANCS), doxorubicin, pegylated doxorubicin, FOLFORI, 5 -fluorouracil, temozolomide, pasireotide, tegafur, gimeracil, oteracil, itraconazole, bortezomib, lenalidomide, irinotecan, epirubicin, romidepsin, resminostat, tasquinimod, refametinib, lapatinib, Tyverb^/E, Arenegyr, NGR-TNF, pasireotide, Signifor^/E, ticilimumab, tremelimumab, lansoprazole, PrevOnco^/E, ABT-869, linifanib, vorolanib, tivantinib, Tarceva_,7E, erlotinib, Stivarga^/E, regorafenib, fluoro-sorafenib, brivanib, liposomal doxorubicin, lenvatinib, ramucirumab, peretinoin, muparfostat, Teysuno^/E, tegafur, gimeracil, oteracil, and orantinib; and antibody therapies, including but not limited to, alemtuzumab, atezolizumab, ipilimumab, nivolumab, ofatumumab, pembrolizumab, or rituximab.
[188] In other examples, a population may be stratified into responders and non-responders for checkpoint inhibitor therapies such as compounds that bind to PD-1 or CTLA4.
[189] In other examples, a population may be stratified into responders and non-responders for anti-VEGF therapies that bind to VEGF pathway targets.
INDICATIONS
[190] In some examples, a biological condition can include a disease. In some examples, a biological condition can be a stage of a disease. In some examples, a biological condition can be a gradual change of a biological state. In some examples, a biological condition can be a treatment effect. In some examples, a biological condition can be a drug effect. In some examples, a biological condition can be a surgical effect. In some examples, a biological condition can be a biological state after a lifestyle modification. Non-limiting examples of lifestyle modifications include a diet change, a smoking change, and a sleeping pattern change. In some examples, a biological condition is unknown. The analysis described herein can include machine learning to infer an unknown biological condition or to interpret the unknown biological condition.
[191] In one example, the present systems and methods are particularly useful for applications related to colon cancer: Cancer that forms in the tissues of the colon (the longest part of the large intestine). Most colon cancers are adenocarcinomas (cancers that begin in cells that make line internal organs and have gland-like properties). Cancer progression is characterized by stages, or the extent of cancer in the body. Staging is usually based on the size of the tumor, whether lymph nodes comprise cancer, and whether the cancer has spread from the original site to other
parts of the body. Stages of colon cancer include stage I, stage II, stage III, and stage IV. Unless otherwise specified, the term "colon cancer" refers to colon cancer at Stage 0, Stage I, Stage II (including Stage IIA or IIB), Stage III (including Stage IIIA, IIIB, or IIIC), or Stage IV. In some examples herein, the colon cancer is from any stage. In one example, the colon cancer is a stage I colorectal cancer. In one example, the colon cancer is a stage II colorectal cancer. In one example, the colon cancer is a stage III colorectal cancer. In one example, the colon cancer is a stage IV colorectal cancer.
[192] Conditions that can be inferred by the disclosed methods include, for example, cancer, gut-associated diseases, immune-mediated inflammatory diseases, neurological diseases, kidney diseases, prenatal diseases, and metabolic diseases.
[193] In some examples, a method of the present disclosure can be used to diagnose a cancer. Non-limiting examples of cancers include adenoma (adenomatous polyps), sessile serrated adenoma (SSA), advanced adenoma, colorectal dysplasia, colorectal adenoma, colorectal cancer, colon cancer, rectal cancer, colorectal carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal carcinoid tumors, gastrointestinal stromal tumors (GISTs), lymphomas, and sarcomas.
[194] Non-limiting examples of cancers that can be inferred by the disclosed methods and systems include acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), adrenocortical carcinoma, Kaposi Sarcoma, anal cancer, basal cell carcinoma, bile duct cancer, bladder cancer, bone cancer, osteosarcoma, malignant fibrous histiocytoma, brain stem glioma, brain cancer, craniopharyngioma, ependymoblastoma, ependymoma, medulloblastoma, medulloepithelioma, pineal parenchymal tumor, breast cancer, bronchial tumor, Burkitt lymphoma, Non-Hodgkin lymphoma, carcinoid tumor, cervical cancer, chordoma, chronic lymphocytic leukemia (CLL), chronic myelogenous leukemia (CML), colon cancer, colorectal cancer, cutaneous T-cell lymphoma, ductal carcinoma in situ, endometrial cancer, esophageal cancer, Ewing Sarcoma, eye cancer, intraocular melanoma, retinoblastoma, fibrous histiocytoma, gallbladder cancer, gastric cancer, glioma, hairy cell leukemia, head and neck cancer, heart cancer, hepatocellular (liver) cancer, Hodgkin lymphoma, hypopharyngeal cancer, kidney cancer, laryngeal cancer, lip cancer, oral cavity cancer, lung cancer, non-small cell carcinoma, small cell carcinoma, melanoma, mouth cancer, myelodysplastic syndromes, multiple myeloma, medulloblastoma, nasal cavity cancer, paranasal sinus cancer, neuroblastoma, nasopharyngeal cancer, oral cancer, oropharyngeal cancer, osteosarcoma, ovarian cancer, pancreatic cancer, papillomatosis, paraganglioma, parathyroid cancer, penile cancer, pharyngeal cancer, pituitary tumor, plasma cell neoplasm, prostate cancer, rectal cancer, renal cell cancer,
rhabdomyosarcoma, salivary gland cancer, Sezary syndrome, skin cancer, small intestine cancer, soft tissue sarcoma, squamous cell carcinoma, testicular cancer, throat cancer, thymoma, thyroid cancer, urethral cancer, uterine cancer, uterine sarcoma, vaginal cancer, vulvar cancer, Waldenstrom macroglobulinemia, and Wilms Tumor.
[195] Non-limiting examples of gut-associated diseases that can be inferred by the disclosed methods and systems include Crohn's disease, colitis, ulcerative colitis (UC), inflammatory bowel disease (IBD), irritable bowel syndrome (IBS), and celiac disease. In some examples, the disease is inflammatory bowel disease, colitis, ulcerative colitis, Crohn's disease, microscopic colitis, collagenous colitis, lymphocytic colitis, diversion colitis, Behcet's disease, and indeterminate colitis.
[196] Non-limiting examples of immune-mediated inflammatory diseases that can be inferred by the disclosed methods and systems include psoriasis, sarcoidosis, rheumatoid arthritis, asthma, rhinitis (hay fever), food allergy, eczema, lupus, multiple sclerosis, fibromyalgia, type 1 diabetes, and Lyme disease. Non-limiting examples of neurological diseases that can be inferred by the disclosed methods and systems include Parkinson's disease, Huntington's disease, multiple sclerosis, Alzheimer's disease, stroke, epilepsy, neurodegeneration, and neuropathy.
[197] Non-limiting examples of kidney diseases that can be inferred by the disclosed methods and systems include interstitial nephritis, acute kidney failure, and nephropathy. Non-limiting examples of prenatal diseases that can be inferred by the disclosed methods and systems include Down syndrome, aneuploidy, spina bifida, trisomy, Edwards syndrome, teratomas, sacrococcygeal teratoma (SCT), ventriculomegaly, renal agenesis, cystic fibrosis, and hydrops fetalis. Non-limiting examples of metabolic diseases that can be inferred by the disclosed methods and systems include cystinosis, Fabry disease, Gaucher disease, Lesch-Nyhan syndrome, Niemann-Pick disease, phenylketonuria, Pompe disease, Tay-Sachs disease.
[198] The specific details of particular examples may be combined in any suitable manner without departing from the spirit and scope of disclosed examples of the inventive concepts. However, other examples of the inventive concepts may be directed to specific examples relating to each individual aspect, or specific combinations of these individual aspects. All patents, patent applications, publications, and descriptions mentioned herein are incorporated by reference in their entirety for all purposes.
EXAMPLES
EXAMPLE 1: DNA Extraction and Library Preparation for Methylation Conversion with Conversion-Resistant Sequencing Adapters.
1. Background
[199] Effective oxidation-dependent protection of methylated cytosine nucleotides against deamination in enzymatic methyl-sequencing (EM-seq) is a key aspect of an enzyme methylation workflow. For example, an incomplete and/or inefficient TET2-mediated oxidation reaction can result in the conversion of a methylated cytosine to a thymine or a uracil species upon APOBEC deamination due to the methylated cytosine not being properly protected via TET-mediated oxidation.
[200] EM-seq workflows may include one or more methylated cytosine nucleotides in adapter molecules to facilitate downstream library amplification and sequencing after deamination. However, if aberrant conversion of methylated cytosines occurs in adapter regions due to incomplete TET-mediated oxidation (and the resultant lack of protected methylated cytosine residues), these adapter regions, which act as template strands for library amplification, can comprise mismatches in the primer binding sites thereby leading to low library yields and undesirable PCR products (Figure 1).
[201] Poor protection of methylated cytosines in adapters can be mitigated by implementing alternative strategies to overcome inefficient TET-oxidation. This can be achieved by altering the adapter design using one or a combination of the following strategies: (i) adding deaminationresistant chemical modifications to cytosine nucleotides (Figures 2A-2C), (ii) annealing small blocking oligos to the ssDNA overhangs on adapters to facilitate a more robust oxidation reaction (Figure 2D), (iii) reducing the number of cytosines in the DNA sequence (Figure 2E), or (iv) leaving cytosines unmethylated to allow for full conversion (Figure 2E). To determine the ability of these strategies to overcome the challenge of inefficient TET-mediated oxidation, the utility of adapters comprising modified cytosines was examined and alternative approaches were implemented to facilitate high PCR yields after deamination independent of TET2 activity (Figure 3). These modified cytosines include propynyl-C, pyrrolo-C, and 5mC.
1. Methods
[202] The conversion-resistant adapters described herein may be used during library preparation which is a process that takes a DNA molecule and adds the necessary components needed for sequencing. More specifically, contrived cfDNA comprising -35% methylated DNA was used as target insert molecules for conversion-resistant adapter ligation. cfDNA was
prepared for ligation by first performing end repair and A-tailing (ERAT) which digests 3’ overhangs, fills in 5’ overhangs, and adds a single adenosine (A) overhang to the 3’ end. The resulting product was then ligated to Y-adapters (conversion-resistant) appropriate for the indicated method in order to facilitate library PCR and sequencing.
[203] Y-adapters, also known as stubby adapters, can comprise a partial dsDNA duplex comprising a T overhang at one end of the molecule and ssDNA overhangs on the other end of the molecule, resulting in a Y shape. Some of the adapter sequence serves as a template for downstream PCR amplification. Depending on the condition tested, cytosines were either left unmodified, or modified by the addition of deamination resistant cytosines (e.g., methyl, pyrrolo, or propynyl group).
[204] After the conversion-resistant adapters are ligated onto the cfDNA inserts, the molecules underwent EM-conversion which comprises two operations: oxidation and deamination. During oxidation, methylated cytosines (5mC) may be oxidized by the TET2 (Ten-eleven translocation- 2) enzyme. This converts methylated cytosines to further states of oxidation (to hydroxymethylated [5hmC], then further to formylcytosine [5fC], and carboxylcytosine [5caC].) After oxidation, deamination may then be carried out using the enzyme APOB EC which can convert cytosines, as well as 5mC and 5hmC, to uracil and thymine and 5-hydrodroxyl uracil (uracil species) respectively. The cytosines that were fully oxidized (for example to 5fC and 5caC) will not be converted. If the adapters comprise cytosines with modifications such as pyrrolo-C or propynyl-C, those modifications are resistant to deamination and will also not be converted. One difference is that these modifications do not require protection during oxidation.
[205] After EM-conversion, the adapter ligated cfDNA molecules were amplified by PCR using primers matching the expected final sequence of adapters after EM-conversion. Some adapter sequences required for PCR amplification rely on cognate cytosines to be resistant to enzymatic methyl conversion (EM-conversion), while in other modalities primers rely on matching the fully converted sequence. Amplification primers also harbor sequences to distinguish between different library prep reactions (index sequences) as well as those required for NGS sequencing.
2. Results
[206] To interrogate the utility of conversion-resistant adapter design strategies, an experiment was performed to assay the ability of a given approach to overcome incomplete TET-mediated oxidation (Figure 3). The adapters (including conversion-resistant adapters disclosed herein) were ligated with 10 nanograms (ng) (Figure 3 top panels) or 3.5 ng (Figure 3 bottom panels) of contrived cfDNA. The ligation product was then treated with TET2 under optimal reducing
conditions using DTT (Figure 3 SOP; left panels), or suboptimal conditions by leaving DTT out of the reaction (Figure 3 DTT; right panels). Unprotected cytosines were deaminated via APOBEC treatment, then libraries were PCR-amplified using an appropriate primer set followed by quantification using a fluorometric assay.
[207] All protection strategies performed similarly when using optimal conditions, e.g., high input mass and reducing oxidation reaction conditions (Figure 3; top left). However, when the oxidation reaction is handicapped by using either low input mass (e.g., 3.5 ng) or the removal of DTT, the conversion-resistant adapters harboring deamination-resistant modified cytosines facilitated increased yields after PCR amplification compared to methylated cytosine adapters. This data indicates that utilizing conversion-resistant adapters comprising deamination-resistant modified cytosines (in particular, pyrrolo-C and propynyl-C) facilitates robust protection against APOBEC-mediated conversion independent of efficient TET2 activity which results in increased library yields.
[208] It has been shown that the TET2 enzyme favors dsDNA templates. To evaluate if this may be used to improve library yield, small blocking oligos were annealed to the ssDNA overhang of methylated Y adapters to generate a dsDNA template for TET2 during oxidation. However, this did not improve library yields after deamination (Methylated-wBlockers) (Figure 3). Furthermore, adapters comprising a minimal number of unmethylated cytosines (Lo-Cy) (amplified with conversion-dependent PCR primers) were also evaluated but did not result in a marked improvement in library yields under any condition (Figure 3).
[209] Figure 4 provides a graph illustrating a comparison of dsDNA library concentrations (library yields) between differing oxidation reactions (TET -2 -mediated reactions) comprising: (i) IX (final reaction concentration) pre-plated thawed Fe2+ (SOP) or (ii) 6X (final reaction concentration) freshly-diluted Fe2+ with 10 ng of input DNA.
[210] Figure 5 provides a graphs illustrating a comparison of CpH protection from EM-seq reactions under differing oxidation conditions (e.g., final reaction Fe2+ concentrations): (i) IX concentrated pre-plated thawed Fe2+ (SOP); (ii) IX concentrated freshly-diluted Fe2+; (iii) 3X concentrated freshly-diluted Fe2+; (iv) 6X concentrated freshly-diluted Fe2+; (v) 9X concentrated freshly-diluted Fe2+; (vi) 12X concentrated freshly-diluted Fe2+; (vii) 18X concentrated freshly-diluted Fe2+; (viii) 36X concentrated freshly-diluted Fe2+; and (ix) 216X concentrated freshly-diluted Fe2+.
[211] Figure 6 provides graphs illustrating a comparison of CpH protection rates from EM-seq reactions under differing reaction conditions (e.g., final Fe2+ concentrations and master mix pH levels) with different reagent lots: (i) IX concentrated freshly-diluted Fe2+ at an average pH of
7.5-8.0 (Left graph); (ii) 3X concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Middle graph); (iii) 6X concentrated freshly-diluted Fe2+ at an average pH of 7.5-8.0 (Right graph).
EXAMPLE 2: DNA Extraction and Library Preparation for Methylation Conversion with Conversion-Resistant Sequencing Adapters.
Methods
1. cfDNA extraction
[212] Up to 4 mL of plasma from a single BCT are used as input for cfDNA extraction. cfDNA extraction is performed by incubating in the presence of carboxyl coated magnetic beads. Following the incubation, the beads are washed and cfDNA is eluted. The elute cfDNA is transferred to a plate and frozen at -20°C.
2. cfDNA library preparation and enzymatic conversion
[213] The cfDNA is thawed and library preparation is performed using the cfDNA from operation 1 (cfDNA extraction). End preparation buffer and end preparation enzyme mix are added to each sample, and the samples are incubated at 20°C for 30 minutes and 65°C for 30 minutes. Following this, the end-prepped cfDNA is incubated with ligation buffer, ligase, and adapters for 15 min at 20°C. The adapters are Y-shaped adapters synthesized with 5- methylcytosines, pyrrolo-C and/or propynyl-C in place of cytosines to enable protection of the adapter sequences during conversion operations. Following end repair, A-tailing, and ligation, a SPRI bead cleanup is performed. Enzymatic conversion is performed on the eluted DNA by treating the eluted DNA for 1 hour at between 37°C - 43 °C with TET2 enzyme and P- glucosyltransferase. Additionally, ferrous iron is added to the reaction at twice the recommended concentration to stabilize TET2 activity. The TET2 reaction is stopped via addition of an oxidation stop buffer and a bead cleanup is performed. At this point, eluted DNA is frozen at - 20°C.
[214] The oxidation-treated cfDNA library is thawed and the samples are heat denatured and treated for 3 hours at 37°C with APOB EC to deaminate unprotected cytosines, thereby converting them to uracils. This is followed with a proteinase K treatment to denature the APOBEC and a 10-minute treatment at 65°C to inactivate the proteinase. The samples are amplified via a universal PCR reaction with indexing primers. After a second proteinase K treatment, a bead purification cleanup is performed. The library DNA is eluted into elution
buffer and is stored at -20°C.
3. Library target capture and sequencing
[215] After the library plates comprising the converted cfDNA library samples from operation 2 (cfDNA library preparation and enzymatic conversion) are thawed, the cfDNA libraries are used as input for target capture. The samples are pooled and concentrated. These library pools are subjected to a bead cleanup, then are eluted in an elution buffer comprising blocker solution and blockers. Target enrichment is then performed with biotinylated probes designed to target both fully converted and fully unconverted methyl-cytosines (Twist Fast Hybridization Target Capture kit) following the manufacturer’s instructions wherein the biotinylated panel and hybridization buffer are added to the library pools, which are then incubated at 60°C for 4 hours. The probe-library pool mixtures are bound to streptavidin beads, then washed. After washing, the captured library pools and streptavidin beads are resuspended in elution buffer. Following hybridization, the capture DNA fragments are amplified by universal PCR.
[216] Following this PCR reaction, a cleanup is performed on the amplified DNA, and quantified. The target capture libraries are then sequenced loaded on an Illumina NovaSeq sequencer using 2x150 runs.
4. Targeted Methylation Classification.
[217] Raw data files are used for alignment and methylation calling to permit targeted methylation analysis for pre-identified regions of the genome.
[218] FASTQ files are mapped to a reference genome, and methylation scores are calculated for disease classification. Featurized data comprising a set of CpG sites associated with healthy, disease, disease state, and treatment responsiveness is processed using machine learning models to identify classifiers that stratify individuals in a population based upon hypermethylation model scores.
[219] While certain examples of methods and systems have been shown and disclosed herein, one of skill in the art will realize that these are provided by way of example only and not intended to be limiting within the specification. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the scope disclosed herein. Furthermore, it shall be understood that all aspects of the disclosed methods and systems are not limited to the specific depictions, configurations or relative proportions set forth herein which
depend upon a variety of conditions and variables and the description is intended to include such alternatives, modifications, variations or equivalents.
Claims
1. A method of preparing a sequencing library for methylation sequencing of one or more nucleic acid molecules of a biological sample or a derivative thereof, comprising:
(a) obtaining a nucleic acid composition, wherein the nucleic acid composition comprises a plurality of double-stranded nucleic acid molecules obtained or derived from the biological sample;
(b) ligating a conversion-resistant nucleic acid adapter to a double-stranded nucleic acid molecule of the plurality of double-stranded nucleic acid molecules to generate a conversion-resistant adapter-ligated nucleic acid molecule, wherein the nucleic acid adapter comprises modified cytosines that are resistant to base conversion by a methylation conversion method; and
(c) subjecting the conversion-resistant adapter-ligated nucleic acid molecule to conditions sufficient to convert unmethylated cytosines to uracils using the methylation conversion method, thereby generating a converted conversionresistant adapter-ligated nucleic acid molecule.
2. The method of claim 1, wherein the ligating in (b) further comprises treating with a deoxyribonucleic acid (DNA) ligase.
3. The method of claim 1 or 2, wherein the conversion-resistant nucleic acid adapter comprises one or more modified cytosines that are deamination-resistant.
4. The method of claim 1 or 2, wherein the conversion-resistant nucleic acid adapter comprises one or more modified cytosine bases selected from a group consisting of propynyl-C, pyrrolo-C, and 5-methylcytosine (5mC).
5. The method of claim 3, wherein the one or more modified cytosines that are deaminationresistant are propynyl-C residues.
6. The method of claim 3, wherein the one or more modified cytosines that are deaminationresistant are pyrrolo-C residues.
7. The method of any one of claims 1 to 6, wherein the conversion-resistant nucleic acid adapter comprises one or more 5mC residues.
8. The method of any one of claims 1 to 6, wherein the conversion-resistant nucleic acid adapter does not comprise methylated cytosine bases or unmethylated cytosine bases.
9. The method of any one of claims 1 to 8, further comprising amplifying the converted conversion-resistant adapter-ligated nucleic acid molecule.
10. The method of claim 9, wherein the amplifying comprises polymerase chain reaction (PCR).
11. The method of any one of claims 1 to 10, further comprising determining a nucleic acid sequence of the amplified adapter-ligated nucleic acid molecule or derivative thereof.
12. The method of any one of claims 1 to 11, further comprising sequencing the amplified adapter-ligated nucleic acid molecule or derivative thereof to generate sequencing data.
13. The method of claim 12, further comprising analyzing the sequencing data to generate a methylation profile of the nucleic acid molecule of the biological sample or derivative thereof.
14. The method of claim 13, wherein the analyzing further comprises comparing the sequencing data to a reference sequence.
15. The method of any one of claims 1 to 14, wherein the nucleic acid molecule comprises deoxyribonucleic acid (DNA).
16. The method of claim 15, wherein the DNA comprises cell-free DNA.
17. The method of any one of claims 1 to 16, wherein the biological sample is a cell-free biological sample.
18. The method of claim 17, wherein the cell-free biological sample is a plasma sample.
19. The method of any one of claims 1 to 18, wherein the methylation conversion method comprises a minimally-destructive conversion method treatment with one or more enzymes.
20. The method of claim 19, wherein the minimally-destructive conversion method comprises treatment with a ten eleven translocation (TET) enzyme.
21. The method of claim 20, wherein the treatment with TET enzyme comprises providing reaction conditions comprising 1X-18X freshly-diluted Fe2+.
22. The method of claim 19, wherein the minimally-destructive conversion method does not comprise treatment with bisulfite.
23. A method for performing methylation sequencing of a cell-free deoxyribonucleic acid (cfDNA) sample from a subject, comprising:
(a) extracting the cfDNA sample from the subject, wherein said cfDNA sample comprises double-stranded nucleic acid (dsDNA) molecules;
(b) ligating conversion-resistant adapters comprising one or more deaminationresistant modified cytosines to the dsDNA molecules, wherein the dsDNA molecules comprise one or more unmethylated cytosine residues;
(c) enzymatically converting the one or more unmethylated cytosine residues to uracils in the dsDNA molecules ligated to the conversion-resistant adapters;
(d) amplifying the converted dsDNA molecules comprising the conversion-resistant adapters of (c) to produce amplified dsDNA molecules; and
(e) determining a nucleic acid sequence of the amplified dsDNA molecules comprising the conversion-resistant adapters of (d).
24. The method of claim 23, further comprising:
(f) determining a methylation profile of the cfDNA sample from the subject based at least in part on (e);
(g) classifying by a trained machine learning algorithm the methylation profile of the cfDNA sample as indicative of a presence of a cancer in the subject; and
(h) outputting a report that identifies the cfDNA sample as negative for the cancer if the trained machine learning algorithm classifies the cfDNA sample as negative for the cancer at a specified confidence level.
25. The method of claim 24, wherein the determining the methylation profile comprises performing a hypermethylation analysis.
26. The method of claim 24 or 25, wherein the cancer comprises two or more of colorectal cancer, breast cancer, pancreatic cancer, liver cancer, or lung cancer.
27. The method of claim 24 or 25, wherein the cancer comprises colorectal cancer, breast cancer, pancreatic cancer, liver cancer, or lung cancer.
28. The method of claim 27, wherein the cancer comprises the colorectal cancer.
29. The method of claim 27, wherein the cancer comprises the lung cancer.
30. The method of claim 27, wherein the cancer comprises the liver cancer.
31. The method of claim 27, wherein the cancer comprises the pancreatic cancer.
32. The method of claim 23, further comprising:
(f) determining a baseline methylation profile of the cfDNA sample of the subject at a baseline methylation state based at least in part on (e);
(g) determining a test methylation profile of a biological sample of the subject at one or more time points following the baseline methylation state of (f); and
(h) determining a change in the test methylation profile as compared to the baseline methylation profile, wherein the change is indicative of a change in a minimal residual disease status of the subject.
33. The method of claim 32, wherein the minimal residual disease status is selected from the group consisting of response to a treatment, tumor load, residual tumor post-surgery, relapse, secondary screen, primary screen, and cancer progression.
34. The method of claim 32, wherein determining the baseline methylation profile comprises a hypermethylation analysis.
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202463644266P | 2024-05-08 | 2024-05-08 | |
| US63/644,266 | 2024-05-08 | ||
| US202463697746P | 2024-09-23 | 2024-09-23 | |
| US63/697,746 | 2024-09-23 | ||
| US202463703374P | 2024-10-04 | 2024-10-04 | |
| US63/703,374 | 2024-10-04 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2025235365A1 true WO2025235365A1 (en) | 2025-11-13 |
Family
ID=97675594
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2025/027719 Pending WO2025235365A1 (en) | 2024-05-08 | 2025-05-05 | Methods and systems for improved methylation sequencing |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2025235365A1 (en) |
-
2025
- 2025-05-05 WO PCT/US2025/027719 patent/WO2025235365A1/en active Pending
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12454724B2 (en) | Methods and systems for high-depth sequencing of methylated nucleic acid | |
| JP7681145B2 (en) | Machine learning implementation for multi-analyte assays of biological samples | |
| JP7689557B2 (en) | An integrated machine learning framework for inferring homologous recombination defects | |
| US12410480B2 (en) | Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis | |
| US20240240257A1 (en) | Compositions and methods for improved 5-hydroxymethylated cytosine resolution in nucleic acid sequencing | |
| CN117413072A (en) | Methods and systems for detecting cancer by nucleic acid methylation analysis | |
| US20250146081A1 (en) | Tcr/bcr profiling for cell-free nucleic acid detection of cancer | |
| WO2025235365A1 (en) | Methods and systems for improved methylation sequencing | |
| WO2025059485A1 (en) | Methods and systems for methylation sequencing | |
| US20250137063A1 (en) | Estimation of circulating tumor fraction using off-target reads of targeted-panel sequencing | |
| HK40052606A (en) | Methods and systems for high-depth sequencing of methylated nucleic acid | |
| WO2025207830A1 (en) | Methods and systems for inferring gene expression using cell-free dna fragments | |
| WO2024155681A1 (en) | Methods and systems for detecting and assessing liver conditions |