[go: up one dir, main page]

WO2025160074A1 - Disease classification with group testing - Google Patents

Disease classification with group testing

Info

Publication number
WO2025160074A1
WO2025160074A1 PCT/US2025/012431 US2025012431W WO2025160074A1 WO 2025160074 A1 WO2025160074 A1 WO 2025160074A1 US 2025012431 W US2025012431 W US 2025012431W WO 2025160074 A1 WO2025160074 A1 WO 2025160074A1
Authority
WO
WIPO (PCT)
Prior art keywords
cancer
sample
sequencing
fragments
pooled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/US2025/012431
Other languages
French (fr)
Inventor
Joseph MARCUS
Oliver Claude VENN
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Grail Inc
Original Assignee
Grail Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Grail Inc filed Critical Grail Inc
Publication of WO2025160074A1 publication Critical patent/WO2025160074A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • cfDNA profiling has emerged as a promising tool for early cancer detection, tumor type classification, and treatment response monitoring. Tumor-specific genomic alterations in cfDNA obtained from cancer patients can be detected and used to determine the presence of tumor and the types of the tumor.
  • DNA methylation plays an important role in regulating gene expression. Aberrant DNA methylation has been implicated in many disease processes, including cancer. DNA methylation profiling using methylation sequencing is increasingly recognized as a valuable diagnostic tool for detection, diagnosis, and/or monitoring of cancer. For example, specific patterns of differentially methylated regions and/or allele specific methylation patterns may be useful as molecular markers for non-invasive diagnostics using cfDNA.
  • Sequencing of DNA fragments in cfDNA sample can be used to identify features that can be used for disease classification. For example, in cancer assessment, cell-free DNA based features, such as presence or absence of a somatic variant, methylation status, or other genetic aberrations, from a blood sample can provide insight into whether a subject may have cancer, and further insight on what type of cancer the subject may have. In the real world, however, the vast majority of individuals have no cancer. For a non-cancer individual, the classification can be easy as the classification score may be far from a predetermined threshold. Nevertheless, testing of an individual requires a complicated and long lab process that can be time-consuming and expensive.
  • the instant disclosure provides a technology that can significantly reduce the cost of screening of a group of patients.
  • the patients can be divided into groups, and samples of each group can be pooled to form a pooled sample.
  • a lower-depth, thus cheaper, sequencing can be performed for each pooled sample, which is then subjected to cancer classification, optionally with adjusted classification parameters. If the classification returns a negative result, all samples in the pooled group can be considered to have no cancer.
  • each sample in the pooled group can be further tested with the conventional classification procedure.
  • the cfDNA fragments in each sample are labeled with a sample-specific barcode (SB) so that even in the pooled sample, it is easy to tell which sample each cfDNA is from, thereby facilitating subsequent sample-specific analysis.
  • SB sample-specific barcode
  • a method for identifying a sample as from a subject having cancer comprising: (A) pooling nucleic acid (NA) fragments of a number (N) of samples to generate a pooled sample, wherein each sample is from a subject; sequencing the NA fragments in the pooled sample at a pooled sequencing depth Dp; aligning the sequenced NA fragments to a reference genome to obtain a location for each NA fragment; feeding the aligned sequences and the locations to a cancer classification model to obtain a pooled cancer probability score; comparing the pooled cancer probability score to a pooled cutoff value Cp; and when the pooled cancer probability score is greater than the Cp, subjecting each sample to the steps in (B), (B) sequencing the NA fragment in the sample at an individual sequencing depth Di; aligning the sequenced NA fragments to the reference genome to obtain a location for each NA fragment; feeding the aligned sequences and the locations to the cancer classification model
  • the Cp is calculated with a cancer incidence rate associated with the demographic of the subjects.
  • the demographic comprises one or more selected from the group consisting of geographic location, gender, age, race, medical history, and employment history.
  • the Cp is calculated with samples of individuals in the demographic.
  • each NA fragment in the pooled sample is ligated to a sample barcode (SB) that is unique to the sample from which the NA fragment is obtained.
  • SB sample barcode
  • the method further comprises, in step (A), identifying NA fragments contributing significantly to the pooled cancer probability score, and one or more sample barcodes ligated to the identified NA fragments.
  • the method further comprises, in each of steps (A) and (B), modifying unmethylated cytosine, prior to sequencing. In some embodiments, the method further comprises, in each of steps (A) and (B), identifying the methylation status of one or more of the NA fragments. In some embodiments, the methylation status is conversion of a cytosine to a 5- mcthylcytosinc (5-mC) or to a 5-hydroxymcthylcytosinc (5-hmC).
  • the modifying unmethylated cytosine is done with bisulfite or enzymatic treatment.
  • the sequencing in each of steps (A) and (B) is deep sequencing.
  • the step (B) further comprises identifying a tissue origin of the cancer.
  • each sample comprises blood, plasma, serum, semen, milk, urine, saliva or cerebral spinal fluid, acquired from a human subject.
  • the NA fragments are cell-free DNA fragments.
  • FIG. 1 is an exemplary flowchart describing an overall workflow of cancer classification of a sample, according to one or more embodiments.
  • FIG. 2 is an exemplary flowchart describing a process of sequencing a fragment of cell- free (cf) DNA to obtain a methylation state vector, according to one or more embodiments.
  • FIG. 3 illustrates an exemplary flowchart of devices for sequencing and analyzing nucleic acid samples according to one or more embodiments.
  • FIG. 4 is an exemplary flowchart describing a process of sample treatment with a sample barcode, according to one or more embodiments.
  • FIG. 5 illustrates an example generation of feature vectors used for training the cancer classifier, according to one or more embodiments.
  • FIG. 6 is an exemplary flowchart describing a process of group testing followed by individual testing.
  • cell free nucleic acid refers to nucleic acid fragments that circulate in an individual’s body (e.g., blood) and originate from one or more healthy cells and/or from one or more unhealthy cells (e.g., cancer cells).
  • cell free DNA refers to deoxyribonucleic acid fragments that circulate in an individual’s body (e.g., blood). Additionally, cfNAs or cfDNA in an individual’s body may come from other non-human sources.
  • genomic nucleic acid refers to nucleic acid molecules or deoxyribonucleic acid molecules obtained from one or more cells.
  • gDNA can be extracted from healthy cells (e.g., non-tumor cells) or from tumor cells (e.g., a biopsy sample).
  • gDNA can be extracted from a cell derived from a blood cell lineage, such as a white blood cell.
  • circulating tumor DNA refers to nucleic acid fragments that originate from tumor cells or other types of cancer cells, and which may be released into a bodily fluid of an individual (e.g., blood, sweat, urine, or saliva) as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
  • DNA fragment may generally refer to any deoxyribonucleic acid fragments, i.e., cfDNA, gDNA, ctDNA, etc.
  • NA fragment may generally refer to any nucleic acid molecule, including DNA molecules and ribonucleic acid (RNA) molecules.
  • amplicon may generally refer to nucleic acid molecules resulting from an amplification process, i.e., including molecules originating from a sample taken from an individual and/or synthetically generated molecules as copies of original molecules.
  • sample barcode may generally refer to a nucleotide sequence that is assigned to a sample and ligated onto sequence reads, for the purpose of accurate assignment of sequence reads as belonging to the sample.
  • molecule identifier may generally refer to a nucleotide sequence that is ligated onto original NA molecules originating from a sample, for the purpose of identifying distinct original NA molecules.
  • unique molecule identifier or “UMI” generally refers to a molecule identifier that is substantially unique compared to other UMIs.
  • Anomalous fragment refers to a fragment that has anomalous methylation of CpG sites.
  • Anomalous methylation of a fragment may be determined using probabilistic models to identify unexpectedness of observing a fragment’s methylation pattern in a control group.
  • UXM unusual fragment with extreme methylation
  • a hypomethylated fragment and a hypermethylated fragment refers to a fragment with at least some number of CpG sites (e.g., 5) that have over some threshold percentage (e.g., 90%) of methylation or unmcthylation, respectively.
  • anomaly score refers to a score for a CpG site based on a number of anomalous fragments (or, in some embodiments, UFXMs) from a sample overlaps that CpG site.
  • the anomaly score is used in context of featurization of a sample for classification.
  • the term “about” or “approximately” can mean within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which can depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. For example, “about” can mean within 1 or more than 1 standard deviation, per the practice in the art. “About” can mean a range of ⁇ 20%, ⁇ 10%, ⁇ 5%, or ⁇ 1% of a given value. The term “about” or “approximately” can mean within an order of magnitude, within 5-fold, or within 2-fold, of a value.
  • biological sample refers to any sample taken from a subject, which can reflect a biological state associated with the subject, and that includes cell-free DNA.
  • biological samples include, but are not limited to, blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal, saliva, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of the subject.
  • a biological sample can include any tissue or material derived from a living or dead subject.
  • a biological sample can be a cell-free sample.
  • a biological sample can comprise a nucleic acid (e.g., DNA or RNA) or a fragment thereof.
  • nucleic acid can refer to deoxyribonucleic acid (DNA), ribonucleic acid (RNA) or any hybrid or fragment thereof.
  • the nucleic acid in the sample can be a cell-free nucleic acid.
  • a sample can be a liquid sample or a solid sample (e.g., a cell or tissue sample).
  • a biological sample can be a bodily fluid, such as blood, plasma, serum, urine, vaginal fluid, fluid from a hydrocele (e.g., of the testis), vaginal flushing fluids, pleural fluid, ascitic fluid, cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage fluid, discharge fluid from the nipple, aspiration fluid from different parts of the body (e.g., thyroid, breast), etc.
  • a biological sample can be a stool sample.
  • the majority of DNA in a biological sample that has been enriched for cell-free DNA can be cell-free (e.g., greater than 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the DNA can be cell- free).
  • a biological sample can be treated to physically disrupt tissue or cell structure (e.g., centrifugation and/or cell lysis), thus releasing intracellular components into a solution which can further contain enzymes, buffers, salts, detergents, and the like which can be used to prepare the sample for analysis.
  • control As used herein, the terms “control,” “control sample,” “reference,” “reference sample,” “normal,” and “normal sample” describe a sample from a subject that does not have a particular condition, or is otherwise healthy.
  • a method as disclosed herein can be performed on a subject having a tumor, where the reference sample is a sample taken from a healthy tissue of the subject.
  • a reference sample can be obtained from the subject, or from a database.
  • the reference can be, e.g., a reference genome that is used to map nucleic acid fragment sequences obtained from sequencing a sample from the subject.
  • a reference genome can refer to a haploid or diploid genome to which nucleic acid fragment sequences from the biological sample and a constitutional sample can be aligned and compared.
  • An example of a constitutional sample can be DNA of white blood cells obtained from the subject.
  • a haploid genome there can be only one nucleotide at each locus.
  • heterozygous loci can be identified; each heterozygous locus can have two alleles, where either allele can allow a match for alignment to the locus.
  • cancer or “tumor” refers to an abnormal mass of tissue in which the growth of the mass surpasses and is not coordinated with the growth of normal tissue.
  • the phrase “healthy,” refers to a subject possessing good health.
  • a healthy subject can demonstrate an absence of any malignant or non-malignant disease.
  • a “healthy individual” can have other diseases or conditions, unrelated to the condition being assayed, which can normally not be considered “healthy.”
  • methylation refers to a modification of deoxyribonucleic acid (DNA) where a hydrogen atom on the pyrimidine ring of a cytosine base is converted to a methyl group, forming 5-methylcytosine.
  • methylation tends to occur at dinucleotides of cytosine and guanine referred to herein as “CpG sites.”
  • CpG sites cytosine and guanine
  • methylation may occur at a cytosine not part of a CpG site or at another nucleotide that’s not cytosine; however, these are rarer occurrences.
  • Anomalous cfDNA methylation can be identified as hypermethylation or hypomethylation, both of which may be indicative of cancer status.
  • DNA methylation anomalies (compared to healthy controls) can cause different effects, which may contribute to cancer.
  • the principles described herein are equally applicable for the detection of methylation in a CpG context and non-CpG context, including non-cytosine methylation.
  • the methylation state vectors may contain elements that are generally vectors of sites where methylation has or has not occurred (even if those sites are not CpG sites specifically).
  • methylation fragment or “nucleic acid methylation fragment” refers to a sequence of methylation states for each CpG site in a plurality of CpG sites, determined by a methylation sequencing of nucleic acids (e.g., a nucleic acid molecule and/or a nucleic acid fragment).
  • a methylation fragment a location and methylation state for each CpG site in the nucleic acid fragment is determined based on the alignment of the sequence reads (e.g., obtained from sequencing of the nucleic acids) to a reference genome.
  • a nucleic acid methylation fragment comprises a methylation state of each CpG site in a plurality of CpG sites (e.g., a methylation state vector), which specifies the location of the nucleic acid fragment in a reference genome (e.g. , as specified by the position of the first CpG site in the nucleic acid fragment using a CpG index, or another similar metric) and the number of CpG sites in the nucleic acid fragment. Alignment of a sequence read to a reference genome, based on a methylation sequencing of a nucleic acid molecule, can be performed using a CpG index.
  • CpG index refers to a list of each CpG site in the plurality of CpG sites (e.g., CpG 1, CpG 2, CpG 3, etc.) in a reference genome, such as a human reference genome, which can be in electronic format.
  • the CpG index further comprises a corresponding genomic location, in the corresponding reference genome, for each respective CpG site in the CpG index.
  • Each CpG site in each respective nucleic acid methylation fragment is thus indexed to a specific location in the respective reference genome, which can be determined using the CpG index.
  • TP true positive
  • “True positive” can refer to a subject that has a tumor, a cancer, a prc-canccrous condition (e.g., a pre- cancerous lesion), a localized or a metastasized cancer, or a non-malignant disease. “True positive” can refer to a subject having a condition and is identified as having the condition by an assay or method of the present disclosure. As used herein, the term “true negative” (TN) refers to a subject that does not have a condition or does not have a detectable condition. True negative can refer to a subject that does not have a disease or a detectable disease, such as a tumor, a cancer, a pre-cancerous condition (e.g..
  • True negative can refer to a subject that does not have a condition or does not have a detectable condition, or is identified as not having the condition by an assay or method of the present disclosure.
  • reference genome refers to any particular known, sequenced or characterized genome, whether partial or complete, of any organism or virus that may be used to reference identified sequences from a subject. Exemplary reference genomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCB1”) or the University of California, Santa Cruz (UCSC).
  • NCB1 National Center for Biotechnology Information
  • UCSC Santa Cruz
  • a “genome” refers to the complete genetic information of an organism or virus, expressed in nucleic acid sequences.
  • a reference sequence or reference genome often is an assembled or partially assembled genomic sequence from an individual or multiple individuals. In some embodiments, a reference genome is an assembled or partially assembled genomic sequence from one or more human individuals.
  • the reference genome can be viewed as a representative example of a species’ set of genes.
  • a reference genome comprises sequences assigned to chromosomes.
  • Exemplary human reference genomes include but are not limited to NCBI build 34 (UCSC equivalent: hg!6), NCBI build 35 (UCSC equivalent: hgl7), NCBI build 36.1 (UCSC equivalent: hgl8), GRCh37 (UCSC equivalent: hgl9), and GRCh38 (UCSC equivalent: hg38).
  • sequence reads refers to nucleotide sequences produced by any sequencing process described herein or known in the art. Reads can be generated from one end of nucleic acid fragments (“single-end reads”), and sometimes are generated from both ends of nucleic acids e.g., paired-end reads, double-end reads). In some embodiments, sequence reads (e.g., single-end or paired-end reads) can be generated from one or both strands of a targeted nucleic acid fragment. The length of the sequence read is often associated with the particular sequencing technology.
  • High-throughput methods provide sequence reads that can vary in size from tens to hundreds of base pairs (bp).
  • the sequence reads are of a mean, median or average length of about 15 bp to 900 bp long (e.g., about 20 bp, about 25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130, about 140 bp, about 150 bp, about 200 bp, about 450 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp.
  • the sequence reads are of a mean, median or average length of about 1000 bp, 2000 bp, 5000 bp, 10,000 bp, or 50,000 bp or more.
  • Nanopore sequencing can provide sequence reads that can vary in size from tens to hundreds to thousands of base pairs.
  • Illumina parallel sequencing can provide sequence reads that do not vary as much, for example, most of the sequence reads can be smaller than 200 bp.
  • a sequence read (or sequencing read) can refer to sequence information corresponding to a nucleic acid molecule (e.g., a string of nucleotides).
  • a sequence read can correspond to a string of nucleotides (e.g., about 20 to about 150) from part of a nucleic acid fragment, can correspond to a string of nucleotides at one or both ends of a nucleic acid fragment, or can correspond to nucleotides of the entire nucleic acid fragment.
  • a sequence read can be obtained in a variety of ways, e.g., using sequencing techniques or using probes, e.g., in hybridization arrays or capture probes, or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
  • PCR polymerase chain reaction
  • sequencing refers generally to any and all biochemical processes that may be used to determine the order of biological macromolecules such as nucleic acids or proteins.
  • sequencing data can include all or a portion of the nucleotide bases in a nucleic acid molecule such as a DNA fragment.
  • sequencing depth is interchangeably used with the term “coverage” and refers to the number of times a locus is covered by a consensus sequence read corresponding to a unique nucleic acid target molecule aligned to the locus; e.g., the sequencing depth is equal to the number of unique nucleic acid target molecules covering the locus.
  • the locus can be as small as a nucleotide, or as large as a chromosome arm, or as large as an entire genome.
  • Sequencing depth can be expressed as “Yx”, e.g., 50x, lOOx, etc., where “Y” refers to the number of times a locus is covered with a sequence corresponding to a nucleic acid target; e.g., the number of times independent sequence information is obtained covering the particular locus.
  • the sequencing depth corresponds to the number of genomes that have been sequenced.
  • Sequencing depth can also be applied to multiple loci, or the whole genome, in which case Y can refer to the mean or average number of times a locus or a haploid genome, or a whole genome, respectively, is sequenced. When a mean depth is quoted, the actual depth for different loci included in the dataset can span over a range of values.
  • Ultra-deep sequencing can refer to at least lOOx in sequencing depth at a locus.
  • bag refers to a manner of grouping sequence reads together. For example, in demultiplexing, bags may be used to separate sequence reads as belonging to particular samples. As another example, in de-duping, bags may be used to identify sequence reads pertaining to amplicons of the same original DNA fragment in a sample.
  • sensitivity or “true positive rate” (TPR) refers to the number of true positives divided by the sum of the number of true positives and false negatives. Sensitivity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly has a condition. For example, sensitivity can characterize the ability of a method to correctly identify the number of subjects within a population having cancer. In another example, sensitivity can characterize the ability of a method to correctly identify the one or more markers indicative of cancer.
  • TNR true negative rate
  • Specificity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly does not have a condition. For example, specificity can characterize the ability of a method to correctly identify the number of subjects within a population not having cancer. In another example, specificity characterizes the ability of a method to correctly identify one or more markers indicative of cancer.
  • the term “subject” refers to any living or non-living organism, including but not limited to a human (e.g., a male human, female human, fetus, pregnant female, child, or the like), a non-human animal, a plant, a bacterium, a fungus or a protist.
  • a human e.g., a male human, female human, fetus, pregnant female, child, or the like
  • a non-human animal e.g., a male human, female human, fetus, pregnant female, child, or the like
  • a non-human animal e.g., a plant, a bacterium, a fungus or a protist.
  • Any human or nonhuman animal can serve as a subject, including but not limited to mammal, reptile, avian, amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine (e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama, alpaca), monkey, ape (e. ., gorilla, chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish, dolphin, whale, and shark.
  • bovine e.g., cattle
  • equine e.g., horse
  • caprine and ovine e.g., sheep, goat
  • swine e.g., pig
  • camelid e.g., camel, llama, alpaca
  • monkey ape
  • ape
  • a subject is a male or female of any stage (e.g., a man, a woman or a child).
  • a subject from whom a sample is taken, or is treated by any of the methods or compositions described herein can be of any age and can be an adult, infant or child.
  • tissue can correspond to a group of cells that group together as a functional unit. More than one type of cell can be found in a single tissue. Different types of tissue may consist of different types of cells (e.g., hepatocytes, alveolar cells or blood cells), but also can correspond to tissue from different organisms (mother vs. fetus) or to healthy cells vs. tumor cells.
  • tissue can generally refer to any group of cells found in the human body (e.g. , heart tissue, lung tissue, kidney tissue, nasopharyngeal tissue, oropharyngeal tissue).
  • tissue or “tissue type” can be used to refer to a tissue from which a cell-free nucleic acid originates.
  • viral nucleic acid fragments can be derived from blood tissue.
  • viral nucleic acid fragments can be derived from tumor tissue.
  • genomic refers to a characteristic of the genome of an organism. Examples of genomic characteristics include, but are not limited to, those relating to the primary nucleic acid sequence of all or a portion of the genome (e.g.
  • nucleotide polymorphism e.g., indel, sequence rearrangement, mutational frequency, etc.
  • copy number e.g., copy number, allele frequency fractions, single chromosome or entire genome ploidy, etc.
  • epigenetic status e.g., covalent nucleic acid modifications such as methylation, histone modifications, nucleosome positioning, etc.
  • expression profile of the organism’s genome e.g., gene expression levels, isotype expression levels, gene expression ratios, etc.
  • the present disclosure provides devices and methods for disease diagnosis and classification with improved efficiency.
  • the method can be viewed as including two stages, a group testing/pre-screening stage and an individual testing/confirmation stage. Both stages, in some embodiments, entail disease classification with sequencing data, which is described in more detail below.
  • FIG. 1 is an exemplary flowchart describing an overall workflow 100 of disease classification (e.g., cancer classification) of a sample, according to one or more embodiments.
  • disease classification e.g., cancer classification
  • a biological sample is collected, for instance, by a healthcare provider.
  • the sample may be a biological sample from an individual patient, or a pooled sample with multiple individual samples, without limitation.
  • Example biological samples include, but are not limited to, tissue biopsy, blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal, saliva, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of the subject.
  • the sample includes genetic material belonging to the individual, which may be extracted and sequenced for cancer classification.
  • the sample is provided to a sequencing device.
  • the healthcare provider may collect other information relating to the individual, e.g., biological sex, age, ethnicity, smoking status, any prior diagnoses, etc.
  • a sequencing device performs sample sequencing 120.
  • a lab clinician may perform one or more processing steps to the sample in preparation of sequencing. Once prepared, the clinician loads the sample in the sequencing device.
  • the sequencing device generally extracts and isolates fragments of nucleic acid that are sequenced to determine a sequence of nucleobases corresponding to the fragments.
  • Sample sequencing includes sample treatment in preparation for sequencing of the fragments in the sample. Sample treatment may include one or more ligation steps, and amplification of the nucleic acid material.
  • the sample treatment includes ligation of a sample barcode.
  • the sample barcode is a polynucleotide sequence that is substantially unique to each sample.
  • the sample barcode is ligated onto each fragment in a sample prior to indexing and sequencing.
  • Each polynucleotide sequence in a sample can also be ligated to a unique molecule identifier (UMI).
  • UMI unique molecule identifier
  • the unique molecule identifiers are also polynucleotide sequences that are ligated onto each fragment originating in the sample, e.g., prior to amplification.
  • the unique molecule identifiers may be utilized in de-duping sequence reads to identify unique fragments originating in the sample.
  • Sequencing may be whole-genome sequencing or targeted sequencing with a target panel.
  • bisulfite sequencing can determine methylations status through bisulfite conversion of unmethylated cytosines at CpG sites.
  • Sample sequencing 120 yields sequences for a plurality of nucleic acid fragments in the sample.
  • the sequences may include methylation state vectors, wherein each methylation state vector describes the methylation statuses for CpG sites on a fragment.
  • cfDNA fragments from an individual can be first treated, for example by converting unmethylated cytosines to uracils, prior to sequencing.
  • the sequence reads can then be compared to a reference genome to identify the methylation states at specific CpG sites within the DNA fragments. Each CpG site may be methylated or unmethylated.
  • Identification of anomalously methylated fragments, in comparison to healthy individuals, may provide insight into a subject’s cancer status.
  • DNA methylation anomalies compared to healthy controls
  • Methylation can typically occur in deoxyribonucleic acid (DNA) when a hydrogen atom on the pyrimidine ring of a cytosine base is converted to a methyl group, forming 5- methylcytosine.
  • methylation can occur at dinucleotides of cytosine and guanine referred to herein as “CpG sites”.
  • CpG sites dinucleotides of cytosine and guanine referred to herein as “CpG sites”.
  • methylation may occur at a cytosine not part of a CpG site or at another nucleotide that is not cytosine; however, these are rarer occurrences.
  • Anomalous DNA methylation can be identified as hypermethylation or hypomethylation, both of which may be indicative of cancer status. Hypermethylation and hypomethylation can be characterized for a DNA fragment, if the DNA fragment comprises more than a threshold number of CpG sites with more than a threshold percentage of those C
  • the principles described herein can be equally applicable for the detection of methylation in a non-CpG context, including non-cytosine methylation.
  • the wet laboratory assay used to detect methylation may vary from those described herein.
  • the methylation state vectors discussed herein may contain elements that are generally sites where methylation has or has not occurred (even if those sites are not CpG sites specifically). With that substitution, the remainder of the processes described herein can be the same, and consequently the inventive concepts described herein can be applicable to those other forms of methylation.
  • An analytics system performs pre-analysis processing 130.
  • Pre-analysis processing 130 may include, but not limited to, demultiplexing, de-duplication of sequence reads, determining metrics relating to coverage, identification of contamination events, determining whether the sample is contaminated, remedial measures to contamination events, calling sequencing error, performing remedial measures, etc.
  • the analytics system collects a set of sequence reads pertaining to the sample usable for the analyses 140.
  • the analytics system performs one or more analyses 140.
  • the analyses are statistical analyses or application of one or more trained models to predict at least a cancer status of the individual from whom the sample is derived. Different genetic features may be evaluated and considered, such as methylation of CpG sites, single nucleotide polymorphisms (SNPs), insertions or deletions (indels), other types of genetic mutation, etc.
  • analyses 140 may include anomalous methylation identification 142, feature extraction 144, and applying a cancer classifier 146 to determine a cancer prediction.
  • the analytics system may utilize one or more age covariate prediction models to generate one or more age covariate residuals as features to cancer classification.
  • the cancer classifier 146 inputs the extracted features to determine a cancer prediction.
  • the cancer prediction may be a label or a value.
  • the label may indicate a particular cancer state, e.g., binary labels can indicate presence or absence of cancer, multiclass labels can indicate one or more cancer types from a plurality of cancer types that are screened for.
  • the value may indicate a likelihood of a particular cancer state, e.g., a likelihood of cancer, and/or a likelihood of a particular cancer type.
  • the analytics system returns the prediction 150.
  • the prediction for an individual sample obtained from a single patient, may be whether the patient has cancer.
  • the prediction may be whether at least one of the patients has cancer, suggesting that further testing of individual samples is needed.
  • the DNA fragments can be treated prior to the sequencing to convert unmethylated cytosines to uracils.
  • the method uses a bisulfite treatment of the DNA which converts the unmethylated cytosines to uracils without converting the methylated cytosines.
  • a commercial kit such as the EZ DNA MethylationTM - Gold, EZ DNA MethylationTM - Direct or an EZ DNA MethylationTM - Lightning kit (available from Zymo Research Corp (Irvine, CA)) is used for the bisulfite conversion.
  • the conversion of unmethylated cytosines to uracils is accomplished using an enzymatic reaction.
  • the conversion can use a commercially available kit for conversion of unmethylated cytosines to uracils, such as APOBEC-Seq (NEBiolabs, Ipswich, MA).
  • a sequencing library can be prepared.
  • unique molecular identifiers (UM1) can be added to the nucleic acid molecules (e.g., DNA molecules) through adapter ligation.
  • the UMIs can be short nucleic acid sequences (e.g., 4- 10 base pairs) that are added to ends of DNA fragments (e.g., DNA molecules fragmented by physical shearing, enzymatic digestion, and/or chemical fragmentation) during adapter ligation.
  • UMIs can be degenerate base pairs that serve as a unique tag that can be used to identify sequence reads originating from a specific DNA fragment.
  • the UMIs can be replicated along with the attached DNA fragment. This can provide a way to identify sequence reads that came from the same original fragment in downstream analysis.
  • the sequencing library may be enriched for DNA fragments, or genomic regions, that are informative for cancer status using a plurality of hybridization probes.
  • the hybridization probes are short oligonucleotides capable of hybridizing to particularly specified DNA fragments, or targeted regions, and enriching for those fragments or regions for subsequent sequencing and analysis.
  • Hybridization probes may be used to perform a targeted, high-depth analysis of a set of specified CpG sites of interest to the researcher.
  • Hybridization probes can be tiled across one or more target sequences at a coverage of IX, 2X, 3X, 4X, 5X, 6X, 7X, 8X, 9X, 10X, or more than 10X.
  • hybridization probes tiled at a coverage of 2X comprises overlapping probes such that each portion of the target sequence is hybridized to 2 independent probes.
  • Hybridization probes can be tiled across one or more target sequences at a coverage of less than IX.
  • the hybridization probes are designed to enrich for DNA molecules that have been treated (e.g., using bisulfite) for conversion of unmethylated cytosines to uracils.
  • hybridization probes also referred to herein as “probes” can be used to target and pull down nucleic acid fragments informative for the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g. , cancer class or tissue of origin).
  • the probes may be designed to anneal (or hybridize) to a target (complementary) strand of DNA.
  • the target strand may be the “positive” strand (e.g., the strand transcribed into mRNA, and subsequently translated into a protein) or the complementary “negative” strand.
  • the probes may range in length from 10s, 100s, or 1000s of base pairs.
  • the probes can be designed based on a methylation site panel.
  • the probes can be designed based on a panel of targeted genes to analyze particular' mutations or target regions of the genome (e.g., of the human or another organism) that are suspected to correspond to certain cancers or other types of diseases.
  • the probes may cover overlapping portions of a target region.
  • One or more alternative sequencing methods can be used for obtaining sequence reads from nucleic acids in a biological sample.
  • the one or more sequencing methods can comprise any form of sequencing that can be used to obtain a number of sequence reads measured from nucleic acids (e.g., cell-free nucleic acids), including, but not limited to, high- throughput sequencing systems such as the Roche 454 platform, the Applied Biosystems SOLID platform, the Helicos True Single Molecule DNA sequencing technology, the sequencing-by- hybridization platform from Affymetrix Inc., the single-molecule, real-time (SMRT) technology of Pacific Biosciences, the sequencing-by- synthesis platforms from 454 Life Sciences, Illumina/Solexa and Helicos Biosciences, and the sequencing-by-ligation platform from Applied Biosystems.
  • high- throughput sequencing systems such as the Roche 454 platform, the Applied Biosystems SOLID platform, the Helicos True Single Molecule DNA sequencing technology, the sequencing-by- hybridization platform from Affymetrix Inc
  • the ION TORRENT technology from Life technologies and Nanopore sequencing can also be used to obtain sequence reads from the nucleic acids (e.g., cell-free nucleic acids) in the biological sample.
  • Sequencing-by-synthesis and reversible terminator-based sequencing e.g., Illumina’s Genome Analyzer; Genome Analyzer II; HISEQ 2000; HISEQ 4500 (Illumina, San Diego Calif.)
  • Genome Analyzer Genome Analyzer II; HISEQ 2000; HISEQ 4500 (Illumina, San Diego Calif.
  • Millions of cell- free nucleic acid (e.g., DNA) fragments can be sequenced in parallel.
  • a flow cell contains an optically transparent slide with eight individual lanes on the surfaces of which are bound oligonucleotide anchors (e.g., adaptor primers).
  • a cell-free nucleic acid sample can include a signal or tag that facilitates detection.
  • the acquisition of sequence reads from the cell-free nucleic acid obtained from the biological sample can include obtaining quantification information of the signal or tag via a variety of techniques such as, for example, flow cytometry, quantitative polymerase chain reaction (qPCR), gel electrophoresis, gene-chip analysis, microarray, mass spectrometry, cytofluorimetric analysis, fluorescence microscopy, confocal laser scanning microscopy, laser scanning cytometry, affinity chromatography, manual batch mode separation, electric field suspension, sequencing, and combination thereof.
  • qPCR quantitative polymerase chain reaction
  • the one or more sequencing methods can include a whole-genome sequencing assay.
  • a whole-genome sequencing assay can comprise a physical assay that generates sequence reads for a whole genome or a substantial portion of the whole genome which can be used to determine large variations such as copy number variations or copy number aberrations.
  • Such a physical assay may employ whole-genome sequencing techniques or whole-exome sequencing techniques.
  • a wholc-gcnomc sequencing assay can have an average sequencing depth of at least lx, 2x, 3x, 4x, 5x, 6x, 7x, 8x, 9x, lOx, at least 20x, at least 30x, or at least 40x across the genome of the test subject. In some embodiments, the sequencing depth is about 30,000x.
  • the one or more sequencing methods can comprise a targeted panel sequencing assay.
  • a targeted panel sequencing assay can have an average sequencing depth of at least 50,000x, at least 55,000x, at least 60,000x, or at least 70,000x sequencing depth for the targeted panel of genes.
  • the targeted panel of genes can comprise between 450 and 500 genes.
  • the targeted panel of genes can comprise a range of 500+5 genes, a range of 500+10 genes, or a range of 500+25 genes.
  • the one or more sequencing methods can include paired-end sequencing.
  • the one or more sequencing methods can generate a plurality of sequence reads.
  • the plurality of sequence reads can have an average length ranging between 10 and 700, between 50 and 400, or between 100 and 300.
  • the one or more sequencing methods can comprise a methylation sequencing assay.
  • the methylation sequencing can be i) whole-genome methylation sequencing or ii) targeted DNA methylation sequencing using a plurality of nucleic acid probes.
  • the methylation sequencing is whole-genome bisulfite sequencing (e.g., WGBS).
  • the methylation sequencing can be a targeted DNA methylation sequencing using a plurality of nucleic acid probes targeting the most informative regions of the methylome, a unique methylation database and prior prototype whole-genome and targeted sequencing assays.
  • the methylation sequencing can detect one or more 5-methylcytosine (5mC) and/or 5- hydroxymethylcytosine (5hmC) in respective nucleic acid methylation fragments.
  • the methylation sequencing can comprise conversion of one or more unmethylated cytosines or one or more methylated cytosines, in respective nucleic acid methylation fragments, to a corresponding one or more uracils.
  • the one or more uracils can be detected during the methylation sequencing as one or more corresponding thymines.
  • the conversion of one or more unmethylated cytosines or one or more methylated cytosines can comprise a chemical conversion, an enzymatic conversion, or combinations thereof.
  • bisulfite conversion involves converting cytosine to uracil while leaving methylated cytosines e.g., 5-methylcytosine or 5-mC) intact.
  • cytosines e.g., 5-methylcytosine or 5-mC
  • about 95% of cytosines may not methylated in the DNA, and the resulting DNA fragments may include many uracils which arc represented by thymines.
  • Enzymatic conversion processes may be used to treat the nucleic acids prior to sequencing, which can be performed in various ways.
  • a bi sulfite-free conversion comprises a bisulfite-free and base-resolution sequencing method, TET- assisted pyridine borane sequencing (TAPS), for non-destructive and direct detection of 5- methylcytosine and 5 -hydroxy methylcytosine without affecting unmodified cytosines.
  • TET- assisted pyridine borane sequencing TAPS
  • the methylation state of a CpG site in the corresponding plurality of CpG sites in the respective nucleic acid methylation fragment can be methylated when the CpG site is determined by the methylation sequencing to be methylated, and unmethylated when the CpG site is determined by the methylation sequencing to not be methylated.
  • a methylation sequencing assay (e.g., WGBS and/or targeted methylation sequencing) can have an average sequencing depth including but not limited to up to about l,000x, 2,000x, 3,000x, 5,000x, 10,000x, 15,000x, 20,000x, or 30,000x.
  • the methylation sequencing can have a sequencing depth that is greater than 30,000x, e.g., at least 40,000x or 50,000x.
  • a whole-genome bisulfite sequencing method can have an average sequencing depth of between 20x and 50x, and a targeted methylation sequencing method has an average effective depth of between lOOx and lOOOx, where effective depth can be the equivalent whole-genome bisulfite sequencing coverage for obtaining the same number of sequence reads obtained by targeted methylation sequencing.
  • the methylation sequencing of nucleic acids and the resulting one or more methylation state vectors can be used to obtain a plurality of nucleic acid methylation fragments.
  • Each corresponding plurality of nucleic acid methylation fragments (e.g., for each respective genotypic dataset) can comprise more than 100 nucleic acid methylation fragments.
  • An average number of nucleic acid methylation fragments across each corresponding plurality of nucleic acid methylation fragments can comprise 1000 or more nucleic acid methylation fragments, 5000 or more nucleic acid methylation fragments, 10,000 or more nucleic acid methylation fragments, 20,000 or more nucleic acid methylation fragments, or 30,000 or more nucleic acid methylation fragments.
  • An average number of nucleic acid methylation fragments across each corresponding plurality of nucleic acid methylation fragments can be between 10,000 nucleic acid methylation fragments and 50,000 nucleic acid methylation fragments.
  • the corresponding plurality of nucleic acid methylation fragments can comprise one thousand or more, ten thousand or more, 100 thousand or more, one million or more, ten million or more, 100 million or more, 500 million or more, one billion or more, two billion or more, three billion or more, four billion or more, five billion or more, six billion or more, seven billion or more, eight billion or more, nine billion or more, or 10 billion or more nucleic acid methylation fragments.
  • An average length of a corresponding plurality of nucleic acid methylation fragments can be between 140 and 480 nucleotides.
  • the sequence reads may be in a computer-readable, digital format for processing and interpretation by computer software.
  • the sequence reads may be aligned to a reference genome to determine alignment position information.
  • the alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read.
  • Alignment position information may also include sequence read length, which can be determined from the beginning position and end position.
  • a region in the reference genome may be associated with a gene or a segment of a gene.
  • a sequence read can be comprised of a read pair denoted as R 1 and R 2 .
  • the first read R ⁇ may be sequenced from a first end of a nucleic acid fragment whereas the second read R 2 may be sequenced from the second end of the nucleic acid fragment. Therefore, nucleotide base pairs of the first read R ⁇ and second read R 2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome. Alignment position information derived from the read pair R ⁇ and R 2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., /? x ) and an end position in the reference genome that corresponds to an end of a second read (e.g., R 2 ).
  • the beginning position and end position in the reference genome can represent the likely location within the reference genome to which the nucleic acid fragment corresponds.
  • An output file having SAM (sequence alignment map) format or BAM (binary) format may be generated and output for further analysis such as methylation state determination.
  • the analytics system determines a location and methylation state for each CpG site based on alignment to a reference genome.
  • the analytics system generates a methylation state vector for each fragment specifying a location of the fragment in the reference genome (e.g., as specified by the position of the first CpG site in each fragment, or another similar metric), a number of CpG sites in the fragment, and the methylation state of each CpG site in the fragment whether methylated (e.g., denoted as M), unmethylated (e.g., denoted as U), or indeterminate (e.g., denoted as I).
  • M methylated
  • U unmethylated
  • I indeterminate
  • Observed states can be states of methylated and unmethylated; whereas, an unobserved state is indeterminate.
  • Indeterminate methylation states may originate from sequencing errors and/or disagreements between methylation states of a DNA fragment's complementary strands.
  • the methylation state vectors may be stored in temporary or persistent computer memory for later use and processing.
  • the analytics system may remove duplicate reads or duplicate methylation state vectors from a single sample.
  • the analytics system may determine that a certain fragment with one or more CpG sites has an indeterminate methylation status over a threshold number or percentage, and may exclude such fragments or selectively include such fragments but build a model accounting for such indeterminate methylation statuses.
  • FIG. 2 is an exemplary illustration of methylation sequencing a cfDNA molecule to obtain a methylation state vector, according to one or more embodiments.
  • the analytics system receives a cfDNA molecule 242 that, in this example, contains three CpG sites. As shown, the first and third CpG sites of the cfDNA molecule 242 are methylated 244. During the treatment step 250, the cfDNA molecule 242 is converted to generate a converted cfDNA molecule 252. During the treatment 250, the second CpG site which was unmethylated has its cytosine converted to uracil. However, the first and third CpG sites were not converted.
  • a sequencing library is prepared and the molecule sequenced 260 to generate a sequence read 262.
  • the analytics system aligns the sequence read 262 to a reference genome 264.
  • the reference genome 264 provides the context as to what position in a human genome the fragment cfDNA originates from.
  • the analytics system aligns 270 the sequence read 262 such that the three CpG sites correlate to CpG sites 23, 24, and 25 (arbitrary reference identifiers used for convenience of description).
  • the analytics system can thus generate information both on methylation status of all CpG sites on the cfDNA molecule 242 and the position in the human genome that the CpG sites map to.
  • the CpG sites on sequence read 262 which are methylated are read as cytosines.
  • the cytosines appear in the sequence read 262 only in the first and third CpG site which allows one to infer that the first and third CpG sites in the original cfDNA molecule are methylated.
  • the second CpG site can be read as a thymine (U is converted to T during the sequencing process), and thus, one can infer that the second CpG site is unmethylated in the original cfDNA molecule.
  • the analytics system With these two pieces of information, the methylation status and location, the analytics system generates 270 a methylation state vector 272 for the fragment cfDNA 242.
  • the resulting methylation state vector 272 is ⁇ M23, U24, M25 >, wherein M corresponds to a methylated CpG site, U corresponds to an unmethylated CpG site, and the subscript number corresponds to a position of each CpG site in the reference genome.
  • the analytics system determines anomalous fragments for a sample using the sample’s methylation state vectors. For each fragment in a sample, the analytics system determines whether the fragment is an anomalous fragment using the methylation state vector corresponding to the fragment. In one embodiment, the analytics system calculates a p-value score for each methylation state vector describing a probability of observing that methylation state vector or other methylation state vectors even less probable in the healthy control group.
  • the analytics system may determine fragments with a methylation state vector having below a threshold p-value score as anomalous fragments.
  • the analytics system further labels fragments with at least some number of CpG sites that have over some threshold percentage of methylation or unmethylation as hypermethylated and hypomethylated fragments, respectively.
  • a hypermethylated fragment or a hypomethylated fragment may also be referred to as an unusual fragment with extreme methylation (UFXM).
  • UXM unusual fragment with extreme methylation
  • the analytics system may implement various other probabilistic models for determining anomalous fragments. Examples of other probabilistic models include a mixture model, a deep probabilistic model, etc.
  • the analytics system may use any combination of the processes described below for identifying anomalous fragments. With the identified anomalous fragments, the analytics system may filter the set of methylation state vectors for a sample for use in other processes, e.g., for use in training and deploying a cancer classifier.
  • the analytics system calculates a p-value score for each methylation state vector compared to methylation state vectors from fragments in a healthy control group.
  • the p-value score describes a probability of observing the methylation status matching that methylation state vector or other methylation state vectors even less probable in the healthy control group.
  • the analytics system uses a healthy control group with a majority of fragments that are normally methylated. When conducting this probabilistic analysis for determining anomalous fragments, the determination holds weight in comparison with the group of control subjects that make up the healthy control group. To ensure robustness in the healthy control group, the analytics system may select some threshold number of healthy individuals to source samples including DNA fragments.
  • the analytics system receives a plurality of DNA fragments (e.g., cfDNA) from a plurality of healthy individuals.
  • a methylation state vector is identified for each fragment.
  • the analytics system subdivides the methylation state vector into strings of CpG sites.
  • the analytics system subdivides the methylation state vector such that the resulting strings are all less than a given length.
  • a methylation state vector of length 11 may be subdivided into strings of length less than or equal to 3 would result in 9 strings of length 3, 10 strings of length 2, and 11 strings of length 1.
  • a methylation state vector of length 7 being subdivided into strings of length less than or equal to 4 would result in 4 strings of length 4, 5 strings of length 3, 6 strings of length 2, and 7 strings of length 1.
  • the methylation state vector may be converted into a single string containing all of the CpG sites of the vector.
  • the analytics system tallies the strings by counting, for each possible CpG site and possibility of methylation states in the vector, the number of strings present in the control group having the specified CpG site as the first CpG site in the string and having that possibility of methylation states. For example, at a given CpG site and considering string lengths of 3, there are 2 A 3 or 8 possible string configurations. At that given CpG site, for each of the 8 possible string configurations, the analytics system tallies how many occurrences of each methylation state vector possibility come up in the control group.
  • this may involve tallying the following quantities: ⁇ M x , M x+ i, M x+ 2 >, ⁇ M x , M x+ i, U x+ 2 >, . . ., ⁇ U x , U x+ i, U x+ 2 > for each starting CpG site x in the reference genome.
  • the analytics system creates the data structure storing the tallied counts for each starting CpG site and string possibility. [0089] To identify anomalously methylated fragments from an individual, the analytics system generates methylation state vectors from cfDNA fragments of the subject.
  • the analytics system enumerates all possibilities of methylation state vectors having the same starting CpG site and same length (i.e., set of CpG sites) in the methylation state vector.
  • each methylation state is generally either methylated or unmethylated there are effectively two possible states at each CpG site, and thus the count of distinct possibilities of methylation state vectors depends on a power of 2, such that a methylation state vector of length n would be associated with 2 11 possibilities of methylation state vectors.
  • the analytics system may enumerate possibilities of methylation state vectors considering only CpG sites that have observed states.
  • the analytics system calculates the probability of observing each possibility of methylation state vector for the identified starting CpG site and methylation state vector length by accessing the healthy control group data structure.
  • calculating the probability of observing a given possibility uses a Markov chain probability to model the joint probability calculation.
  • calculation methods other than Markov chain probabilities are used to determine the probability of observing each possibility of methylation state vector.
  • the analytics system calculates a p-value score for the methylation state vector using the calculated probabilities for each possibility. In one embodiment, this includes identifying the calculated probability corresponding to the possibility that matches the methylation state vector in question. Specifically, this is the possibility having the same set of CpG sites, or similarly the same starting CpG site and length as the methylation state vector. The analytics system sums the calculated probabilities of any possibilities having probabilities less than or equal to the identified probability to generate the p-value score.
  • This p-value represents the probability of observing the methylation state vector of the fragment or other methylation state vectors even less probable in the healthy control group.
  • a low p-value score thereby, generally corresponds to a methylation state vector which is rare in a healthy individual, and which causes the fragment to be labeled anomalously methylated, relative to the healthy control group.
  • a high p-value score generally relates to a methylation state vector is expected to be present, in a relative sense, in a healthy individual. If the healthy control group is a non-cancerous group, for example, a low p-value indicates that the fragment is anomalous methylated relative to the non-canccr group, and therefore possibly indicative of the presence of cancer in the test subject.
  • the analytics system calculates p-value scores for each of a plurality of methylation state vectors, each representing a cfDNA fragment in the test sample.
  • the analytics system may filter the set of methylation state vectors based on their p-value scores. In one embodiment, filtering is performed by comparing the p-values scores against a threshold and keeping only those fragments below the threshold. This threshold p-value score could be on the order of 0.1, 0.01, 0.001, 0.0001, or similar.
  • the analytics system yields a median (range) of 2,800 (1,500-12,000) fragments with anomalous methylation patterns for participants without cancer in training, and a median (range) of 3,000 (1,200-220,000) fragments with anomalous methylation patterns for participants with cancer in training.
  • the analytics system uses a sliding window to determine possibilities of methylation state vectors and calculate p-values. Rather than enumerating possibilities and calculating p-values for entire methylation state vectors, the analytics system enumerates possibilities and calculates p-values for only a window of sequential CpG sites, where the window is shorter in length (of CpG sites) than at least some fragments (otherwise, the window would serve no purpose).
  • the window length may be static, user determined, dynamic, or otherwise selected.
  • the window In calculating p-values for a methylation state vector larger than the window, the window identifies the sequential set of CpG sites from the vector within the window starting from the first CpG site in the vector.
  • the analytic system calculates a p-value score for the window including the first CpG site.
  • the analytics system then “slides” the window to the second CpG site in the vector, and calculates another p-value score for the second window.
  • each methylation state vector will generate m-l+1 p-value scores.
  • the analytics system aggregates the p-value scores for the methylation state vectors to generate an overall p-value score.
  • Using the sliding window helps to reduce the number of enumerated possibilities of methylation state vectors and their corresponding probability calculations that would otherwise need to be performed. To give a realistic example, it is possible for fragments to have upwards of 54 CpG sites.
  • the analytics system can instead use a window of size 5 (for example) which results in 50 p-value calculations for each of the 50 windows of the methylation state vector for that fragment.
  • a window of size 5 for example
  • Each of the 50 calculations enumerates 2 5 (32) possibilities of methylation state vectors, which total results in 50x2 A 5 (1.6xl0 A 3) probability calculations. This results in a vast reduction of calculations to be performed, with no meaningful hit to the accurate identification of anomalous fragments.
  • the analytics system may calculate a p-value score summing out CpG sites with indeterminates states in a fragment’s methylation state vector.
  • the analytics system identifies all possibilities that have consensus with the all methylation states of the methylation state vector excluding the indeterminate states.
  • the analytics system may assign the probability to the methylation state vector as a sum of the probabilities of the identified possibilities.
  • the analytics system calculates a probability of a methylation state vector of ⁇ Mi, I2, U3 > as a sum of the probabilities for the possibilities of methylation state vectors of ⁇ Mi, M2, U3 > and ⁇ Mi, U2, U3 > since methylation states for CpG sites 1 and 3 are observed and in consensus with the fragment’s methylation states at CpG sites 1 and 3.
  • This method of summing out CpG sites with indeterminate states uses calculations of probabilities of possibilities up to 2 A i, wherein i denotes the number of indeterminate states in the methylation state vector.
  • a dynamic programming algorithm may be implemented to calculate the probability of a methylation state vector with one or more indeterminate states.
  • the dynamic programming algorithm operates in linear computational time.
  • the computational burden of calculating probabilities and/or p-value scores may be further reduced by caching at least some calculations.
  • the analytic system may cache in transitory or persistent memory calculations of probabilities for possibilities of methylation state vectors (or windows thereof). If other fragments have the same CpG sites, caching the possibility probabilities allows for efficient calculation of p-score values without needing to re-calculate the underlying possibility probabilities.
  • the analytics system may calculate p-value scores for each of the possibilities of methylation state vectors associated with a set of CpG sites from vector (or window thereof).
  • the analytics system may cache the p- value scores for use in determining the p-value scores of other fragments including the same CpG sites.
  • the p-value scores of possibilities of methylation state vectors having the same CpG sites may be used to determine the p-value score of a different one of the possibilities from the same set of CpG sites.
  • FIG. 3 is a flowchart of devices for sequencing nucleic acid samples according to one embodiment.
  • This illustrative flowchart includes devices such as a sequencer 320 and an analytics system 300.
  • the sequencer 320 and the analytics system 300 may work in tandem to perform one or more steps in the sequencing and analytics processes.
  • the sequencer 320 receives an enriched nucleic acid sample 310.
  • the sequencer 320 can include a graphical user interface 325 that enables user interactions with particular tasks (e.g., initiate sequencing or terminate sequencing) as well as one more loading stations 330 for loading a sequencing cartridge including the enriched fragment samples and/or for loading necessary buffers for performing the sequencing assays. Therefore, once a user of the sequencer 320 has provided the necessary reagents and sequencing cartridge to the loading station 330 of the sequencer 320, the user can initiate sequencing by interacting with the graphical user interface 325 of the sequencer 320. Once initiated, the sequencer 320 performs the sequencing and outputs the sequence reads of the enriched fragments from the nucleic acid sample 310.
  • the sequencer 320 is communicatively coupled with the analytics system 300.
  • the analytics system 300 includes some number of computing devices used for processing the sequence reads for various applications such as assessing methylation status at one or more CpG sites, variant calling or quality control.
  • the sequencer 320 may provide the sequence reads in a BAM file format to the analytics system 300.
  • the analytics system 300 can be communicatively coupled to the sequencer 320 through a wireless, wired, or a combination of wireless and wired communication technologies.
  • the analytics system 300 is configured with a processor and non-transitory computer-readable storage medium storing computer instructions that, when executed by the processor, cause the processor to process the sequence reads or to perform one or more steps of any of the methods or processes disclosed herein.
  • a sequence read is comprised of a read pair denoted as R_1 and R_2.
  • the first read R_1 may be sequenced from a first end of a double- stranded DNA (dsDNA) molecule whereas the second read R_2 may be sequenced from the second end of the double- stranded DNA (dsDNA). Therefore, nucleotide base pairs of the first read R_1 and second read R_2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome.
  • Alignment position information derived from the read pair R_1 and R_2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R_l) and an end position in the reference genome that corresponds to an end of a second read (e.g., R_2).
  • the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds.
  • An output file having SAM (sequence alignment map) format or BAM (binary) format may be generated and output for further analysis.
  • FIG. 4 illustrates a process in which unique molecule identifiers (UMIs) are ligated to each fragment in a sample, and then a common sample barcode (SB) is added to all fragments in the sample, which is unique among all samples being processed in a pool, prior to sequencing.
  • UMIs unique molecule identifiers
  • SB common sample barcode
  • UMI 1 415 is ligated onto fragment 1 410
  • UMI 2 425 is ligated onto fragment 2 420.
  • UMI 1 415 and UMI 2 425 are substantially distinct.
  • the diagonal hatched boxes represent nucleotides on the non-ligated end of the UMIs.
  • the diagonal hatched boxes may be primers for binding enzymes to the fragments.
  • the UMIs are ligated onto the 3’ end of the original fragments.
  • Amplification occurs to amplify the original fragments ligated with UMIs.
  • the resulting amplicons may comprise synthetically generated fragments and/or the original fragments.
  • fragment 1 410 is copied resulting in three amplicons, fragment 1 A 412, fragment IB 412, and fragment 1C 412.
  • Each of the amplicons of fragment 1 410 includes UMI 1 415 that was ligated onto the original fragment 1 410.
  • Fragment 2 420 is amplified resulting in two amplicons, fragment 2A 422 and fragment 2B 422, each with UMI 2425.
  • a sample barcode (SB) is appended to the amplicons.
  • all amplicons receive sample barcode (SB) 405 at the 5’ end, which is opposite the UMIs.
  • the SB 405 may also comprise one or more nucleotides at the non-ligating end to protect the sample barcode.
  • sample-specific sample barcode SB
  • two or more samples may be pooled to form a pooled sample for group testing.
  • Disease classifiers can be trained and/or tuned to receive a feature vector for a pooled sample and determine whether the pooled sample contains an individual sample from a test subject that has cancer or, more specifically, a particular cancer type.
  • the cancer classifier comprises a plurality of classification parameters and a function representing a relation between the feature vector as input and the cancer prediction as output determined by the function operating on the input feature vector with the classification parameters.
  • the feature vectors input into the cancer classifier are based on a set of anomalous fragments determined from the pooled sample. The anomalous fragments may be determined via a process as described in Section A2 above.
  • a cancer classifier can be trained by first obtaining a plurality of training samples each having a set of anomalous fragments and a label of a cancer type.
  • the plurality of training samples includes any combination of samples from healthy individuals with a general label of “noncancer,” samples from subjects with a general label of “cancer” or a specific label (e.g., “breast cancer,” “lung cancer,” etc.).
  • the training samples from subjects for one cancer type may be termed a cohort for that cancer type or a cancer type cohort.
  • the analytics system determines, for each training sample, a feature vector based on the set of anomalous fragments of the training sample.
  • the analytics system calculates an anomaly score for each CpG site in an initial set of CpG sites.
  • the initial set of CpG sites may be all CpG sites in the human genome or some portion thereof - which may be on the order of 10 4 , 10 5 , 10 6 , 10 7 , 10 8 , etc.
  • the analytics system defines the anomaly score for the feature vector with a binary scoring based on whether there is an anomalous fragment in the set of anomalous fragments that encompasses the CpG site.
  • the analytics system defines the anomaly score based on a count of anomalous fragments overlapping the CpG site.
  • the analytics system may use a trinary scoring assigning a first score for lack of presence of anomalous fragments, a second score for presence of a few anomalous fragments, and a third score for presence of more than a few anomalous fragments. For example, the analytics system counts 5 anomalous fragment in a sample that overlap the CpG site and calculates an anomaly score based on the count of 5.
  • the analytics system determines the feature vector as a vector of elements including, for each element, one of the anomaly scores associated with one of the CpG sites in an initial set.
  • the analytics system normalizes the anomaly scores of the feature vector based on a coverage of the sample.
  • coverage refers to a median or average sequencing depth over all CpG sites covered by the initial set of CpG sites used in the classifier, or based on the set of anomalous fragments for a given training sample.
  • FIG. 5 illustrating a matrix of training feature vectors 522.
  • the analytics system has identified CpG sites [K] 526 for consideration in generating feature vectors for the cancer classifier.
  • the analytics system selects training samples [N] 524.
  • the analytics system determines a first anomaly score 528 for a first arbitrary CpG site [kl] to be used in the feature vector for a training sample [nl],
  • the analytics system checks each anomalous fragment in the set of anomalous fragments. If the analytics system identifies at least one anomalous fragment that includes the first CpG site, then the analytics system determines the first anomaly score 528 for the first CpG site as 1, as illustrated in FIG. 5.
  • the analytics system similarly checks the set of anomalous fragments for at least one that includes the second CpG site [k2] . If the analytics system does not find any such anomalous fragment that includes the second CpG site, the analytics system determines a second anomaly score 529 for the second CpG site [k2] to be 0, as illustrated in FIG. 5.
  • the analytics system determines the feature vector for the first training sample [nl] including the anomaly scores with the feature vector including the first anomaly score 528 of 1 for the first CpG site [kl] and the second anomaly score 529 of 0 for the second CpG site [k2] and subsequent anomaly scores, thus forming a feature vector [1, 0, ..
  • the analytics system may further limit the CpG sites considered for use in the cancer classifier.
  • the analytics system computes, for each CpG site in the initial set of CpG sites, an information gain based on the feature vectors of the training samples.
  • Each training sample has a feature vector that may contain an anomaly score all CpG sites in the initial set of CpG sites which could include up to all CpG sites in the human genome.
  • some CpG sites in the initial set of CpG sites may not be as informative as others in distinguishing between cancer types, or may be duplicative with other CpG sites.
  • the analytics system computes an information gain for each cancer type and for each CpG site in the initial set to determine whether to include that CpG site in the classifier.
  • the information gain is computed for training samples with a given cancer type compared to all other samples.
  • two random variables ‘anomalous fragment’ (‘AF’) and ‘cancer type’ (‘CT’) are used.
  • AF is a binary variable indicating whether there is an anomalous fragment overlapping a given CpG site in a given samples as determined for the anomaly score / feature vector above.
  • CT is a random variable indicating whether the cancer is of a particular type.
  • the analytics system computes the mutual information with respect to CT given AF.
  • the analytics system computes pairwise mutual information gain against each other cancer type and sums the mutual information gain across all the other cancer types.
  • the analytics system uses this information to rank CpG sites based on how cancer specific they are. This procedure is repeated for all cancer types under consideration. If a particular region is commonly anomalously methylated in training samples of a given cancer but not in training samples of other cancer types or in healthy training samples, then CpG sites overlapped by those anomalous fragments will tend to have high information gains for the given cancer type.
  • the ranked CpG sites for each cancer type are greedily added (selected) to a selected set of CpG sites based on their rank for use in the cancer classifier.
  • the analytics system may consider other selection criteria for selecting informative CpG sites to be used in the cancer classifier.
  • One selection criterion may be that the selected CpG sites are above a threshold separation from other selected CpG sites.
  • the selected CpG sites are to be over a threshold number of base pairs away from any other selected CpG site (e.g., 100 base pairs), such that CpG sites that are within the threshold separation are not both selected for consideration in the cancer classifier.
  • the analytics system may modify the feature vectors of the training samples as needed. For example, the analytics system may truncate feature vectors to remove anomaly scores corresponding to CpG sites not in the selected set of CpG sites.
  • the analytics system may train the cancer classifier in any of a number of ways.
  • the analytics system trains a binary cancer classifier to distinguish between cancer and non-cancer based on the feature vectors of the training samples.
  • the analytics system uses training samples that include both noncancer samples from healthy individuals and cancer samples from subjects. Each training sample has one of the two labels “cancer” or “non-cancer.”
  • the classifier outputs a cancer prediction indicating the likelihood of the presence or absence of cancer.
  • the analytics system trains a multiclass cancer classifier to distinguish between many cancer types (also referred to as tissue of origin (TOO) labels).
  • Cancer types include one or more cancers and may include a non-cancer type (may also include any additional other diseases or genetic disorders, etc.).
  • the analytics system uses the cancer type cohorts and may also include or not include a non-cancer type cohort.
  • the cancer classifier is trained to determine a cancer prediction (or, more specifically, a TOO prediction) that comprises a prediction value for each of the cancer types being classified for.
  • the prediction values may correspond to a likelihood that a given training sample (and during inference, a test sample) has each of the cancer types.
  • the prediction values are scored between 0 and 100, wherein the cumulation of the prediction values equals 100.
  • the cancer classifier returns a cancer prediction including a prediction value for breast cancer, lung cancer, and non-canccr.
  • the classifier can return a cancer prediction that a test sample is 65% likelihood of breast cancer, 25% likelihood of lung cancer, and 10% likelihood of non-cancer.
  • the analytics system may further evaluate the prediction values to generate a prediction of a presence of one or more cancers in the sample, also may be referred to as a TOO prediction indicating one or more TOO labels, e.g., a first TOO label with the highest prediction value, a second TOO label with the second highest prediction value, etc.
  • the system may determine that the sample has breast cancer given that breast cancer has the highest likelihood.
  • the analytics system trains the cancer classifier by inputting sets of training samples with their feature vectors into the cancer classifier and adjusting classification parameters so that a function of the classifier accurately relates the training feature vectors to their corresponding label.
  • the analytics system may group the training samples into sets of one or more training samples for iterative batch training of the cancer classifier. After inputting all sets of training samples including their training feature vectors and adjusting the classification parameters, the cancer classifier is sufficiently trained to label test samples according to their feature vector within some margin of error.
  • the analytics system may train the cancer classifier according to any one of a number of methods.
  • the binary cancer classifier may be a L2-regularized logistic regression classifier that is trained using a log-loss function.
  • the multi-cancer classifier may be a multinomial logistic regression.
  • either type of cancer classifier may be trained using other techniques. These techniques are numerous including potential use of kernel methods, random forest classifier, a mixture model, an autoencoder model, machine learning algorithms such as multilayer neural networks, etc.
  • the cancer classifiers are trained with individual samples.
  • further tuning can be conducted.
  • the analytics system determines a test feature vector for use by the cancer classifier.
  • the analytics system calculates an anomaly score for each CpG site in a plurality of CpG sites in use by the cancer classifier.
  • the cancer classifier receives as input feature vectors inclusive of anomaly scores for 1 ,000 selected CpG sites.
  • the analytics system thus determines a test feature vector inclusive of anomaly scores for the 1,000 selected CpG sites based on the set of anomalous fragments.
  • the analytics system calculates the anomaly scores in a same manner as the training samples.
  • the analytics system defines the anomaly score as a binary score based on whether there is a hypermethylated or hypomethylated fragment in the set of anomalous fragments that encompasses the CpG site.
  • the analytics system then inputs the test feature vector into the cancer classifier.
  • the function of the cancer classifier then generates a cancer prediction probability based on the classification parameters trained in the process and the test feature vector.
  • the cancer prediction has predictions probability values for each of the many cancer types. Therefore, the analytics system may determine that the test sample is most likely to be of one of the cancer types.
  • the cancer prediction may be 60% likelihood of non-cancer and 40% likelihood of cancer.
  • the analytics system can use 50% as a cutoff, and thus determine that the test sample is likely not to have cancer.
  • the cutoff value may need to adjusted. The adjustment can be helpful for one or more of the following reasons. First, when a plurality of individual samples are pooled, the proportion of a DNA fragment that contributes to one or more features of the classification can be diluted. The dilution can affect the prediction outcome.
  • the sequencing depth for a pooled sample may be reduced as compared to that for an individual sample.
  • Such a reduction of sequencing depth may be relative to a particular sample, i.e., reduced for an individual sample given the dilution.
  • the reduction of sequencing depth in some embodiments, can be absolute, i.e., less sequencing for a given amount of nucleotide fragments.
  • the sequencing depth at the group stage is lower than that at the subsequent individual sample testing stage (as discussed below).
  • the sequencing depth is calculated for each genomic locus in the entire sample (i.e., the pooled sample in the groups stage, or individual sample in the individual stage), regardless from which individual sample a particular nucleotide fragment is. Such a sequencing depth, therefore, can be conveniently referred to as “pooled sequencing depth.”
  • the pooled sequencing depth at the group stage is at least lx less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 2x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 3x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 4x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 5x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least lOx less than the sequencing depth at the individual stage.
  • the pooled sequencing depth at the group stage is at least 15x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 20x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 30x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 50x less than the sequencing depth at the individual stage.
  • the pooled sequencing depth at the group stage is at least 5% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 10% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 15% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 20% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 25% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 30% less than the sequencing depth at the individual stage.
  • the pooled sequencing depth at the group stage is at least 40% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 50% less than the sequencing depth at the individual stage. [0132] In some embodiments, the sequencing depth is calculated for each genomic locus from all nucleotide fragment originated from an individual sample (not all nucleotide fragment at the same from all samples in the pooled sample). Such a sequencing depth can be conveniently referred to as “individual sequencing depth.” If a pooled sample includes N individual samples, each at similar amounts, then the individual sequencing depth is about 1/N of the pooled sequencing depth. The individual sequencing depth may be estimated by dividing the pooled sequencing depth by N. Alternatively, the individual sequencing depth can be measured directly since all nucleotide fragments from an individual sample are labeled with a unique sample barcode (SB).
  • SB sample barcode
  • the individual sequencing depth at the group stage is at least lx less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 2x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 3x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 4x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 5x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least lOx less than the sequencing depth at the individual stage.
  • the individual sequencing depth at the group stage is at least 15x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 20x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 30x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 50x less than the sequencing depth at the individual stage.
  • the individual sequencing depth at the group stage is at least 5% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 10% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 15% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 20% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 25% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 30% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 40% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 50% less than the sequencing depth at the individual stage.
  • the cutoff value determined on the basis of individual sample testing is lowered for the group stage testing. Lowering the cutoff value, in some embodiments, increases the sensitivity of cancer detection in a pooled sample. In some embodiments, the cutoff value may be decreased, as compared to the one for individual samples, by about 5%, 10%, 15%, 20%, 25%, 30%, 35% or 40%, without limitation.
  • the adjusted cutoff value can be determined for a particular demographic population. For instance, if all samples in the group are from a particular geographic (e.g., downtown Detroit) area or a particular age/gender group (e.g., retired male autoworkers), then certain cancer statistics of the demographics can be used for adjusting the cutoff value.
  • a particular geographic e.g., downtown Detroit
  • a particular age/gender group e.g., retired male autoworkers
  • the overall cancer rate of the demographic group is used for determining the cutoff value.
  • the overall cancer rate may be 1,000 cancer cases per 100,000 people.
  • the cutoff value for such a cancer rate can be calculated and applied to the group. Calculation of a suitable cutoff value for a particular cancer rate, for instance, can be done with one or more simulation experiments, as demonstrated in the Experimental Examples.
  • a suitable cutoff value is determined with actual samples (or actual data) taken from and representing the demographic population. For instance, for downtown Detroit, samples of a number of residents e.g. , 100, 500, or 1 ,000) may be used to run a simulation test to determine a suitable cutoff that optimizes sensitivity, specification, and number of runs. In another example, provided that the sequencing data are already available, they can be used for a virtual simulation to determine a suitable cutoff value.
  • the classifier can return a cancer prediction that a group sample is 55% likelihood of breast cancer, 15% likelihood of lung cancer, and 20% likelihood of non-cancer.
  • the system may determine that the sample has breast cancer given that breast cancer has the highest likelihood.
  • a pooled group sample that includes no individual cancer sample is likely classified as non-cancer. In that case, no further testing is required, and the cancer detection for the entire group concludes, with a predication output that all samples in the group have no cancer. In a case where the predicted probability of having cancer is higher than the cutoff value, a prediction output comes back to indicate that at least an individual sample in the group has cancer.
  • each nucleotide fragment in a pooled sample is ligated to a unique sample barcode (SB).
  • SB sample barcode
  • each individual sample in the pooled sample is subjected to another round of analysis, which can be referred to as “individual stage classification.”
  • the individual stage classification in similar to the group stage classification in many aspects. For instance, both entail ligation of unique molecule identifiers (UMIs), fragment enrichment, methylation-specific modification and sequencing, and classification.
  • UMIs unique molecule identifiers
  • fragment enrichment fragment enrichment
  • methylation-specific modification and sequencing and classification.
  • the group stage classification uses higher cancer probability threshold (cutoff value) than the group stage classification. For instance, for a classification, a 40% cutoff value may be used at the group stage, and a 50% cutoff value may be used at the individual stage.
  • the second difference in some embodiments, is on sequencing depth, which is discussed in more detail in the previous section. Overall, the group stage sequencing has lower sequencing depth at least with respect to each individual sample than the individual stage, which contributes to added saving of time and expenses.
  • the instant technology retains high specificity (low false positive rates) as compared to the conventional technology.
  • FIG. 6 shows a process 600 that includes two stages of cancer classification, a group stage one 610, and an individual stage one 620.
  • a sample is obtained from an individual that desired cancer diagnosis.
  • the sample can be, without limitation, blood, plasma, serum, semen, milk, urine, saliva or cerebral spinal fluid.
  • Each sample includes a plurality of nucleic acid (NA) fragments.
  • the NA fragments are cfDNA fragments.
  • the samples can be treated to modify unmethylated cytosines in the NA fragments (651), to enable methylation detection subsequently.
  • the sample can be treated with bisulfite ion (e.g., using sodium bisulfite) to convert unmethylated cytosines (“C”) to uracils (“U”).
  • bisulfite ion e.g., using sodium bisulfite
  • C unmethylated cytosines
  • U uracils
  • the conversion of unmethylated cytosines to uracils is accomplished using an enzymatic conversion reaction, for example, using a cytidine deaminase, such as APOBEC-Seq (NEBiolabs, Ipswich, MA).
  • Each NA fragment from a sample can be ligated to a sample-specific barcode (SB) (653).
  • SB sample-specific barcode
  • An example process for SB-labeling is illustrated in FIG. 4.
  • each NA fragment can be further ligated to a unique molecule identifier,” or “UMI”, and/or enriched.
  • the SB-labeled NA fragments from each sample can then be pooled together to prepare a “pooled sample” (655).
  • the pooled sample can then be subjected to sequencing (657).
  • the sequencing may be Sanger sequencing, fragment analysis, or next- generation sequencing.
  • the sequenced fragments can be aligned to a reference genome (e.g., human reference genome assembly hgl9 or hg38). Such alignment can help determine the locus of each sequence read. For methylation detection, the alignment can also help determine the methylation status of each cytosine from the original, unmodified fragments.
  • a sequencing process in particular the deep sequencing, can reach a certain level of depths, as described above. Generally, higher sequencing depths requires longer time, may need more sample, and is more expensive.
  • the sequencing depth (Dp) for each sample within the pooled sample at this group stage can be lower than that in the subsequent individual stage (Di), or the total Dp/N ⁇ Di, where N represents the number of samples in the pooled sample, assuming that each sample has similar amounts of NA fragments.
  • the relative values of DP and DI are discussed in more details above in Section C2.
  • Dp is at least 5%, 10%, 20%, 30%, 40%, or 50% lower than N x Di.
  • a cancer probability score is a percentage, such as 40% (or a pair of percentages, such as 60% likely non-cancer and 40% cancer).
  • the cancer probability score obtained in step 661 can be compared to a predetermined cutoff value (Cp) (663).
  • the cutoff value (Cp) at the group stage can be different from that at the subsequent individual stage (Ci).
  • a conventional cancer classification cutoff value is set for individual patient samples. Further adjustment may be need to apply it to pooled samples. In some embodiments, the adjustment takes as input a cancer incident rate associated with the individuals. In some embodiments, the Cp is determined with actual samples from individuals representing the demographic of the samples being tested. Further of these embodiments have been discussed above.
  • step 661 If the cancer probability score obtained in step 661 is lower than the pooled cutoff value (Cp), then the system can make a prediction that none of the samples in the pooled sample is from a cancer patient (630). On the other hand, if the cancer probability score obtained in step 661 is greater than the pooled cutoff value (Cp) (640), then the procedure moves to the individual stage (620) - each sample in the pooled sample is individually tested for its cancer status.
  • Cp pooled cutoff value
  • the individual stage is to certain extent similar to the group stage, including an optional cytosine modification step (665), a sequencing step (667) to a sequencing depth of Di, an alignment step (669), a classification step to calculate cancer probability score (671), and a prediction step with an individual cutoff value Ci (673).
  • an optional cytosine modification step (665) a sequencing step (667) to a sequencing depth of Di
  • an alignment step (669) a classification step to calculate cancer probability score (671)
  • a prediction step with an individual cutoff value Ci (673.
  • the DI and CI may be different from the Dp and Dp used in the group stage.
  • the sample can be predicted as having cancer.
  • the classification makes a further prediction with respect to the type or subtype of the cancer.
  • the methods, analytic systems and/or classifier of the present invention can be used to detect the presence of cancer, monitor cancer progression or recurrence, monitor therapeutic response or effectiveness, determine a presence or monitor minimum residual disease (MRD), or any combination thereof.
  • a classifier can be used to generate a probability score (e.g., from 0 to 100) describing a likelihood that a test feature vector is from a subject with cancer.
  • a cancer detection is made as to whether or not the subject has cancer.
  • the cancer detection can be made at multiple different time points (e.g., before or after treatment) to monitor disease progression or to monitor treatment effectiveness (e.g., therapeutic efficacy).
  • the cancer detection can be used to make or influence a clinical decision (e.g., diagnosis of cancer, treatment selection, assessment of treatment effectiveness, etc.). For example, in one embodiment, following the cancer detection, a physician can prescribe an appropriate treatment.
  • cancers include, without limitation, retinoblastoma, thecoma, arrhenoblastoma, hematological malignancies, including but not limited to non-Hodgkin's lymphoma (NHL), multiple myeloma and acute hematological malignancies, endometriosis, fibrosarcoma, choriocarcinoma, laryngeal carcinomas, Kaposi's sarcoma, Schwannoma, oligodendroglioma, neuroblastomas, rhabdomyosarcoma, osteogenic sarcoma, leiomyosarcoma, and urinary tract carcinomas.
  • NDL non-Hodgkin's lymphoma
  • multiple myeloma and acute hematological malignancies including but not limited to non-Hodgkin's lymphoma (NHL), multiple myeloma and acute hematological malignancies, endometriosis, fibrosar
  • the cancer is one or more of anorectal cancer, bladder cancer, breast cancer, cervical cancer, colorectal cancer, esophageal cancer, gastric cancer, head & neck cancer, hepatobiliary cancer, leukemia, lung cancer, lymphoma, melanoma, multiple myeloma, ovarian cancer, pancreatic cancer, prostate cancer, renal cancer, thyroid cancer, uterine cancer, or any combination thereof.
  • the cancer prediction can be assessed at multiple different time points (e.g., or before or after treatment) to monitor disease progression or to monitor treatment effectiveness (e.g., therapeutic efficacy).
  • the present invention include methods that involve obtaining a first sample (e.g. , a first plasma cfDNA sample) from a cancer patient at a first time point, determining a first cancer prediction therefrom (as described herein), obtaining a second test sample (e.g., a second plasma cfDNA sample) from the cancer patient at a second time point, and determining a second cancer prediction therefrom (as described herein).
  • the first time point is before a cancer treatment (e.g., before a resection surgery or a therapeutic intervention), and the second time point is after a cancer treatment (e.g., after a resection surgery or therapeutic intervention), and the classifier is utilized to monitor the effectiveness of the treatment. For example, if the second cancer prediction decreases compared to the first cancer prediction , then the treatment is considered to have been successful. However, if the second cancer prediction increases compared to the first cancer prediction , then the treatment is considered to have not been successful. In other embodiments, both the first and second time points are before a cancer treatment (e.g., before a resection surgery or a therapeutic intervention).
  • both the first and the second time points are after a cancer treatment (e.g., after a resection surgery or a therapeutic intervention).
  • cfDNA samples may be obtained from a cancer patient at a first and second time point and analyzed, e.g., to monitor cancer progression, to determine if a cancer is in remission (e.g., after treatment), to monitor or detect residual disease or recurrence of disease, or to monitor treatment (e.g., therapeutic) efficacy.
  • test samples can be obtained from a cancer patient over any desired set of time points and analyzed in accordance with the methods of the invention to monitor a cancer state in the patient.
  • the first and second time points are separated by an amount of time that ranges from about 15 minutes up to about 30 years, such as about 30 minutes, such as about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or about 24 hours, such as about 1, 2, 3, 4, 5, 10, 15, 20, 25 or about 30 days, or such as about 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , or 12 months, or such as about 1 , 1 .5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5, 15, 15.5, 16, 16.5, 17, 17.5, 18, 18.5, 19, 19.5, 20, 20.5, 21, 21.5, 22, 22.5, 23, 23.5, 24, 24.5,
  • the cancer prediction can be used to make or influence a clinical decision (e.g., diagnosis of cancer, treatment selection, assessment of treatment effectiveness, etc.). For example, in one embodiment, if the cancer prediction (e.g., for cancer or for a particular cancer type) exceeds a threshold, a physician can prescribe an appropriate treatment (e.g., a resection surgery, radiation therapy, chemotherapy, and/or immunotherapy).
  • the treatment is one or more cancer therapeutic agents selected from the group consisting of a chemotherapy agent, a targeted cancer therapy agent, a differentiating therapy agent, a hormone therapy agent, and an immunotherapy agent.
  • the treatment can be one or more chemotherapy agents selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, anti-tumor antibiotics, cytoskeletal disruptors (taxans), topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, platinum-based agents and any combination thereof.
  • the treatment is one or more targeted cancer therapy agents selected from the group consisting of signal transduction inhibitors (e.g. tyrosine kinase and growth factor receptor inhibitors), histone deacetylase (HD AC) inhibitors, retinoic receptor agonists, proteosome inhibitors, angiogenesis inhibitors, and monoclonal antibody conjugates.
  • signal transduction inhibitors e.g. tyrosine kinase and growth factor receptor inhibitors
  • HD AC histone deacetylase
  • retinoic receptor agonists retinoic receptor agonists
  • proteosome inhibitors angiogenesis inhibitors
  • the treatment is one or more differentiating therapy agents including retinoids, such as tretinoin, alitretinoin and bexarotene.
  • the treatment is one or more hormone therapy agents selected from the group consisting of anti-estrogens, aromatase inhibitors, progestins, estrogens, anti-androgens, and GnRH agonists or analogs.
  • the treatment is one or more immunotherapy agents selected from the group comprising monoclonal antibody therapies such as rituximab (RITUXAN) and alemtuzumab (CAMPATH), non-specific immunotherapies and adjuvants, such as BCG, interleukin-2 (IL-2), and interferon-alfa, immunomodulating drugs, for instance, thalidomide and lenalidomide (REVLIMID).
  • monoclonal antibody therapies such as rituximab (RITUXAN) and alemtuzumab (CAMPATH)
  • non-specific immunotherapies and adjuvants such as BCG, interleukin-2 (IL-2), and interferon-alfa
  • immunomodulating drugs for instance, thalidomide and lenalidomide (REVLIMID). It is within the capabilities of a skilled physician or oncologist to select an appropriate cancer therapeutic agent based on characteristics such as the type of tumor, cancer stage, previous exposure to cancer treatment or therapeutic agent, and other characteristics of
  • the group size was set at 2, and the cancer rate is set at 1% (0.01).
  • 50 groups were generated, and only one of the 100 samples had cancer.
  • the sensitivity was 54.20% and the specificity was 99.37%.
  • a probability of cancer given positive result (PPV) was then calculated as shown below:
  • Embodiments of the invention may also relate to an apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus.
  • any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
  • any of the steps, operations, or processes described herein as being performed by the analytics system may be performed or implemented with one or more hardware or software modules of the apparatus, alone or in combination with other computing devices.
  • a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Public Health (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Systems and methods are described for detecting a disease in a group of patients. As compared to the conventional sequencing-based diagnostic methods, the instant technology can significantly reduce the cost of screening, in particular when the number of samples is large. The patients can be divided into groups, and samples of each group can be pooled to form a pooled sample. A lower-depth, thus cheaper, sequencing can be performed for each pooled sample, which is then subjected to cancer classification, optionally with adjusted classification parameters. If the classification returns a negative result, all samples in the pooled group can be considered to have no cancer. For pooled group that is determined to include one or more cancer samples, each individual sample in the group can be further tested to identify the cancer samples.

Description

DISEASE CLASSIFICATION WITH GROUP TESTING
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of priority of U.S. Provisional patent application no. 63/623.693. filed on January' 22, 2024, the disclosure of which is incorporated herein by this reference in its entirety.
BACKGROUND
[0002] Cell-free DNA (cfDNA) profiling has emerged as a promising tool for early cancer detection, tumor type classification, and treatment response monitoring. Tumor-specific genomic alterations in cfDNA obtained from cancer patients can be detected and used to determine the presence of tumor and the types of the tumor.
[0003] In particular, DNA methylation plays an important role in regulating gene expression. Aberrant DNA methylation has been implicated in many disease processes, including cancer. DNA methylation profiling using methylation sequencing is increasingly recognized as a valuable diagnostic tool for detection, diagnosis, and/or monitoring of cancer. For example, specific patterns of differentially methylated regions and/or allele specific methylation patterns may be useful as molecular markers for non-invasive diagnostics using cfDNA.
[0004] In a given population, however, the vast majority of individuals may not have cancer. Therefore, subjecting all cancer-free individuals for the exhaustive cfDNA profiling can be expensive and not productive. There is a need to improve the testing efficiency and reduce cost.
SUMMARY
[0005] Sequencing of DNA fragments in cfDNA sample can be used to identify features that can be used for disease classification. For example, in cancer assessment, cell-free DNA based features, such as presence or absence of a somatic variant, methylation status, or other genetic aberrations, from a blood sample can provide insight into whether a subject may have cancer, and further insight on what type of cancer the subject may have. In the real world, however, the vast majority of individuals have no cancer. For a non-cancer individual, the classification can be easy as the classification score may be far from a predetermined threshold. Nevertheless, testing of an individual requires a complicated and long lab process that can be time-consuming and expensive.
[0006] The instant disclosure provides a technology that can significantly reduce the cost of screening of a group of patients. The patients can be divided into groups, and samples of each group can be pooled to form a pooled sample. A lower-depth, thus cheaper, sequencing can be performed for each pooled sample, which is then subjected to cancer classification, optionally with adjusted classification parameters. If the classification returns a negative result, all samples in the pooled group can be considered to have no cancer.
[0007] If the classification returns a negative result, then each sample in the pooled group can be further tested with the conventional classification procedure. Optionally, the cfDNA fragments in each sample are labeled with a sample- specific barcode (SB) so that even in the pooled sample, it is easy to tell which sample each cfDNA is from, thereby facilitating subsequent sample-specific analysis.
[0008] In accordance with one embodiment of the present disclosure, therefore, provided is a method for identifying a sample as from a subject having cancer, comprising: (A) pooling nucleic acid (NA) fragments of a number (N) of samples to generate a pooled sample, wherein each sample is from a subject; sequencing the NA fragments in the pooled sample at a pooled sequencing depth Dp; aligning the sequenced NA fragments to a reference genome to obtain a location for each NA fragment; feeding the aligned sequences and the locations to a cancer classification model to obtain a pooled cancer probability score; comparing the pooled cancer probability score to a pooled cutoff value Cp; and when the pooled cancer probability score is greater than the Cp, subjecting each sample to the steps in (B), (B) sequencing the NA fragment in the sample at an individual sequencing depth Di; aligning the sequenced NA fragments to the reference genome to obtain a location for each NA fragment; feeding the aligned sequences and the locations to the cancer classification model to obtain an individual cancer probability score; comparing the individual cancer probability score to an individual cutoff value Ci; and when the individual cancer probability score is greater than the Ci, identifying the sample as from a subject having cancer; wherein the Cp is lower than CL [0009] In some embodiments, the Dp is lower than N x Di. In some embodiments, the Dp is at least 10%, 20% or 30% lower than N x Di.
[0010] In some embodiments, the Cp is calculated with a cancer incidence rate associated with the demographic of the subjects. In some embodiments, the demographic comprises one or more selected from the group consisting of geographic location, gender, age, race, medical history, and employment history. In some embodiments, the Cp is calculated with samples of individuals in the demographic.
[0011] In some embodiments, each NA fragment in the pooled sample is ligated to a sample barcode (SB) that is unique to the sample from which the NA fragment is obtained.
[0012] In some embodiments, the method further comprises, in step (A), identifying NA fragments contributing significantly to the pooled cancer probability score, and one or more sample barcodes ligated to the identified NA fragments.
[0013] In some embodiments, the method further comprises, in each of steps (A) and (B), modifying unmethylated cytosine, prior to sequencing. In some embodiments, the method further comprises, in each of steps (A) and (B), identifying the methylation status of one or more of the NA fragments. In some embodiments, the methylation status is conversion of a cytosine to a 5- mcthylcytosinc (5-mC) or to a 5-hydroxymcthylcytosinc (5-hmC).
[0014] In some embodiments, the modifying unmethylated cytosine is done with bisulfite or enzymatic treatment.
[0015] In some embodiments, the sequencing in each of steps (A) and (B) is deep sequencing. In some embodiments, the step (B) further comprises identifying a tissue origin of the cancer.
[0016] In some embodiments, each sample comprises blood, plasma, serum, semen, milk, urine, saliva or cerebral spinal fluid, acquired from a human subject. In some embodiments, the NA fragments are cell-free DNA fragments. BRIEF DESCRIPTION OF DRAWINGS
[0017] FIG. 1 is an exemplary flowchart describing an overall workflow of cancer classification of a sample, according to one or more embodiments.
[0018] FIG. 2 is an exemplary flowchart describing a process of sequencing a fragment of cell- free (cf) DNA to obtain a methylation state vector, according to one or more embodiments.
[0019] FIG. 3 illustrates an exemplary flowchart of devices for sequencing and analyzing nucleic acid samples according to one or more embodiments.
[0020] FIG. 4 is an exemplary flowchart describing a process of sample treatment with a sample barcode, according to one or more embodiments.
[0021] FIG. 5 illustrates an example generation of feature vectors used for training the cancer classifier, according to one or more embodiments.
[0022] FIG. 6 is an exemplary flowchart describing a process of group testing followed by individual testing.
[0023] The figures depict various embodiments for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein.
DETAILED DESCRIPTION
DEFINITIONS
[0024] The term “cell free nucleic acid” or “cfNA” refers to nucleic acid fragments that circulate in an individual’s body (e.g., blood) and originate from one or more healthy cells and/or from one or more unhealthy cells (e.g., cancer cells). The term “cell free DNA,” or “cfDNA” refers to deoxyribonucleic acid fragments that circulate in an individual’s body (e.g., blood). Additionally, cfNAs or cfDNA in an individual’s body may come from other non-human sources.
[0025] The term “genomic nucleic acid,” “genomic DNA,” or “gDNA” refers to nucleic acid molecules or deoxyribonucleic acid molecules obtained from one or more cells. In various embodiments, gDNA can be extracted from healthy cells (e.g., non-tumor cells) or from tumor cells (e.g., a biopsy sample). In some embodiments, gDNA can be extracted from a cell derived from a blood cell lineage, such as a white blood cell.
[0026] The term “circulating tumor DNA” or “ctDNA” refers to nucleic acid fragments that originate from tumor cells or other types of cancer cells, and which may be released into a bodily fluid of an individual (e.g., blood, sweat, urine, or saliva) as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
[0027] The term “DNA fragment,” “fragment,” or “DNA molecule” may generally refer to any deoxyribonucleic acid fragments, i.e., cfDNA, gDNA, ctDNA, etc.
[0028] The term “NA fragment,” or “NA molecule” may generally refer to any nucleic acid molecule, including DNA molecules and ribonucleic acid (RNA) molecules.
[0029] The term “amplicon” may generally refer to nucleic acid molecules resulting from an amplification process, i.e., including molecules originating from a sample taken from an individual and/or synthetically generated molecules as copies of original molecules.
[0030] The term “sample barcode” may generally refer to a nucleotide sequence that is assigned to a sample and ligated onto sequence reads, for the purpose of accurate assignment of sequence reads as belonging to the sample.
[0031] The term “molecule identifier,” or “MI” may generally refer to a nucleotide sequence that is ligated onto original NA molecules originating from a sample, for the purpose of identifying distinct original NA molecules. The term “unique molecule identifier,” or “UMI” generally refers to a molecule identifier that is substantially unique compared to other UMIs.
[0032] The term “anomalous fragment,” “anomalously methylated fragment,” or “fragment with an anomalous methylation pattern” refers to a fragment that has anomalous methylation of CpG sites. Anomalous methylation of a fragment may be determined using probabilistic models to identify unexpectedness of observing a fragment’s methylation pattern in a control group.
[0033] The term “unusual fragment with extreme methylation” or “UFXM” refers to a hypomethylated fragment or a hypermethylated fragment. A hypomethylated fragment and a hypermethylated fragment refers to a fragment with at least some number of CpG sites (e.g., 5) that have over some threshold percentage (e.g., 90%) of methylation or unmcthylation, respectively.
[0034] The term “anomaly score” refers to a score for a CpG site based on a number of anomalous fragments (or, in some embodiments, UFXMs) from a sample overlaps that CpG site. The anomaly score is used in context of featurization of a sample for classification.
[0035] As used herein, the term “about” or “approximately” can mean within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which can depend in part on how the value is measured or determined, e.g., the limitations of the measurement system. For example, “about” can mean within 1 or more than 1 standard deviation, per the practice in the art. “About” can mean a range of ±20%, ±10%, ±5%, or ±1% of a given value. The term “about” or “approximately” can mean within an order of magnitude, within 5-fold, or within 2-fold, of a value. Where particular values are described in the application and claims, unless otherwise stated the term “about” meaning within an acceptable error range for the particular value should be assumed. The term “about” can have the meaning as commonly understood by one of ordinary skill in the art. The term “about” can refer to ±10%. The term “about” can refer to ±5%.
[0036] As used herein, the term “biological sample,” “patient sample,” or “sample” refers to any sample taken from a subject, which can reflect a biological state associated with the subject, and that includes cell-free DNA. Examples of biological samples include, but are not limited to, blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal, saliva, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of the subject. A biological sample can include any tissue or material derived from a living or dead subject. A biological sample can be a cell-free sample. A biological sample can comprise a nucleic acid (e.g., DNA or RNA) or a fragment thereof. The term “nucleic acid” can refer to deoxyribonucleic acid (DNA), ribonucleic acid (RNA) or any hybrid or fragment thereof. The nucleic acid in the sample can be a cell-free nucleic acid. A sample can be a liquid sample or a solid sample (e.g., a cell or tissue sample). A biological sample can be a bodily fluid, such as blood, plasma, serum, urine, vaginal fluid, fluid from a hydrocele (e.g., of the testis), vaginal flushing fluids, pleural fluid, ascitic fluid, cerebrospinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage fluid, discharge fluid from the nipple, aspiration fluid from different parts of the body (e.g., thyroid, breast), etc. A biological sample can be a stool sample. In various embodiments, the majority of DNA in a biological sample that has been enriched for cell-free DNA (e.g., a plasma sample obtained via a centrifugation protocol) can be cell-free (e.g., greater than 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the DNA can be cell- free). A biological sample can be treated to physically disrupt tissue or cell structure (e.g., centrifugation and/or cell lysis), thus releasing intracellular components into a solution which can further contain enzymes, buffers, salts, detergents, and the like which can be used to prepare the sample for analysis.
[0037] As used herein, the terms “control,” “control sample,” “reference,” “reference sample,” “normal,” and “normal sample” describe a sample from a subject that does not have a particular condition, or is otherwise healthy. In an example, a method as disclosed herein can be performed on a subject having a tumor, where the reference sample is a sample taken from a healthy tissue of the subject. A reference sample can be obtained from the subject, or from a database. The reference can be, e.g., a reference genome that is used to map nucleic acid fragment sequences obtained from sequencing a sample from the subject. A reference genome can refer to a haploid or diploid genome to which nucleic acid fragment sequences from the biological sample and a constitutional sample can be aligned and compared. An example of a constitutional sample can be DNA of white blood cells obtained from the subject. For a haploid genome, there can be only one nucleotide at each locus. For a diploid genome, heterozygous loci can be identified; each heterozygous locus can have two alleles, where either allele can allow a match for alignment to the locus.
[0038] As used herein, the term “cancer” or “tumor” refers to an abnormal mass of tissue in which the growth of the mass surpasses and is not coordinated with the growth of normal tissue.
[0039] As used herein, the phrase “healthy,” refers to a subject possessing good health. A healthy subject can demonstrate an absence of any malignant or non-malignant disease. A “healthy individual” can have other diseases or conditions, unrelated to the condition being assayed, which can normally not be considered “healthy.” [0040] As used herein, the term “methylation” refers to a modification of deoxyribonucleic acid (DNA) where a hydrogen atom on the pyrimidine ring of a cytosine base is converted to a methyl group, forming 5-methylcytosine. In particular, methylation tends to occur at dinucleotides of cytosine and guanine referred to herein as “CpG sites.” In other instances, methylation may occur at a cytosine not part of a CpG site or at another nucleotide that’s not cytosine; however, these are rarer occurrences. Anomalous cfDNA methylation can be identified as hypermethylation or hypomethylation, both of which may be indicative of cancer status. DNA methylation anomalies (compared to healthy controls) can cause different effects, which may contribute to cancer. The principles described herein are equally applicable for the detection of methylation in a CpG context and non-CpG context, including non-cytosine methylation. Further, the methylation state vectors may contain elements that are generally vectors of sites where methylation has or has not occurred (even if those sites are not CpG sites specifically).
[0041] As used interchangeably herein, the term “methylation fragment” or “nucleic acid methylation fragment” refers to a sequence of methylation states for each CpG site in a plurality of CpG sites, determined by a methylation sequencing of nucleic acids (e.g., a nucleic acid molecule and/or a nucleic acid fragment). In a methylation fragment, a location and methylation state for each CpG site in the nucleic acid fragment is determined based on the alignment of the sequence reads (e.g., obtained from sequencing of the nucleic acids) to a reference genome. A nucleic acid methylation fragment comprises a methylation state of each CpG site in a plurality of CpG sites (e.g., a methylation state vector), which specifies the location of the nucleic acid fragment in a reference genome (e.g. , as specified by the position of the first CpG site in the nucleic acid fragment using a CpG index, or another similar metric) and the number of CpG sites in the nucleic acid fragment. Alignment of a sequence read to a reference genome, based on a methylation sequencing of a nucleic acid molecule, can be performed using a CpG index. As used herein, the term “CpG index” refers to a list of each CpG site in the plurality of CpG sites (e.g., CpG 1, CpG 2, CpG 3, etc.) in a reference genome, such as a human reference genome, which can be in electronic format. The CpG index further comprises a corresponding genomic location, in the corresponding reference genome, for each respective CpG site in the CpG index. Each CpG site in each respective nucleic acid methylation fragment is thus indexed to a specific location in the respective reference genome, which can be determined using the CpG index. [0042] As used herein, the term “true positive” (TP) refers to a subject having a condition. “True positive” can refer to a subject that has a tumor, a cancer, a prc-canccrous condition (e.g., a pre- cancerous lesion), a localized or a metastasized cancer, or a non-malignant disease. “True positive” can refer to a subject having a condition and is identified as having the condition by an assay or method of the present disclosure. As used herein, the term “true negative” (TN) refers to a subject that does not have a condition or does not have a detectable condition. True negative can refer to a subject that does not have a disease or a detectable disease, such as a tumor, a cancer, a pre-cancerous condition (e.g.. a pre-cancerous lesion), a localized or a metastasized cancer, a non- malignant disease, or a subject that is otherwise healthy. True negative can refer to a subject that does not have a condition or does not have a detectable condition, or is identified as not having the condition by an assay or method of the present disclosure.
[0043] As used herein, the term “reference genome” refers to any particular known, sequenced or characterized genome, whether partial or complete, of any organism or virus that may be used to reference identified sequences from a subject. Exemplary reference genomes used for human subjects as well as many other organisms are provided in the on-line genome browser hosted by the National Center for Biotechnology Information (“NCB1”) or the University of California, Santa Cruz (UCSC). A “genome” refers to the complete genetic information of an organism or virus, expressed in nucleic acid sequences. As used herein, a reference sequence or reference genome often is an assembled or partially assembled genomic sequence from an individual or multiple individuals. In some embodiments, a reference genome is an assembled or partially assembled genomic sequence from one or more human individuals. The reference genome can be viewed as a representative example of a species’ set of genes. In some embodiments, a reference genome comprises sequences assigned to chromosomes. Exemplary human reference genomes include but are not limited to NCBI build 34 (UCSC equivalent: hg!6), NCBI build 35 (UCSC equivalent: hgl7), NCBI build 36.1 (UCSC equivalent: hgl8), GRCh37 (UCSC equivalent: hgl9), and GRCh38 (UCSC equivalent: hg38).
[0044] As used herein, the term “sequence reads” or “reads” refers to nucleotide sequences produced by any sequencing process described herein or known in the art. Reads can be generated from one end of nucleic acid fragments (“single-end reads”), and sometimes are generated from both ends of nucleic acids e.g., paired-end reads, double-end reads). In some embodiments, sequence reads (e.g., single-end or paired-end reads) can be generated from one or both strands of a targeted nucleic acid fragment. The length of the sequence read is often associated with the particular sequencing technology. High-throughput methods, for example, provide sequence reads that can vary in size from tens to hundreds of base pairs (bp). In some embodiments, the sequence reads are of a mean, median or average length of about 15 bp to 900 bp long (e.g., about 20 bp, about 25 bp, about 30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about 60 bp, about 65 bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95 bp, about 100 bp, about 110 bp, about 120 bp, about 130, about 140 bp, about 150 bp, about 200 bp, about 450 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500 bp. In some embodiments, the sequence reads are of a mean, median or average length of about 1000 bp, 2000 bp, 5000 bp, 10,000 bp, or 50,000 bp or more. Nanopore sequencing, for example, can provide sequence reads that can vary in size from tens to hundreds to thousands of base pairs. Illumina parallel sequencing can provide sequence reads that do not vary as much, for example, most of the sequence reads can be smaller than 200 bp. A sequence read (or sequencing read) can refer to sequence information corresponding to a nucleic acid molecule (e.g., a string of nucleotides). For example, a sequence read can correspond to a string of nucleotides (e.g., about 20 to about 150) from part of a nucleic acid fragment, can correspond to a string of nucleotides at one or both ends of a nucleic acid fragment, or can correspond to nucleotides of the entire nucleic acid fragment. A sequence read can be obtained in a variety of ways, e.g., using sequencing techniques or using probes, e.g., in hybridization arrays or capture probes, or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
[0045] As used herein, the terms “sequencing” and the like as used herein refers generally to any and all biochemical processes that may be used to determine the order of biological macromolecules such as nucleic acids or proteins. For example, sequencing data can include all or a portion of the nucleotide bases in a nucleic acid molecule such as a DNA fragment.
[0046] As used herein, the term “sequencing depth,” is interchangeably used with the term “coverage” and refers to the number of times a locus is covered by a consensus sequence read corresponding to a unique nucleic acid target molecule aligned to the locus; e.g., the sequencing depth is equal to the number of unique nucleic acid target molecules covering the locus. The locus can be as small as a nucleotide, or as large as a chromosome arm, or as large as an entire genome. Sequencing depth can be expressed as “Yx”, e.g., 50x, lOOx, etc., where “Y” refers to the number of times a locus is covered with a sequence corresponding to a nucleic acid target; e.g., the number of times independent sequence information is obtained covering the particular locus. In some embodiments, the sequencing depth corresponds to the number of genomes that have been sequenced. Sequencing depth can also be applied to multiple loci, or the whole genome, in which case Y can refer to the mean or average number of times a locus or a haploid genome, or a whole genome, respectively, is sequenced. When a mean depth is quoted, the actual depth for different loci included in the dataset can span over a range of values. Ultra-deep sequencing can refer to at least lOOx in sequencing depth at a locus.
[0047] As used herein, the term “bag” refers to a manner of grouping sequence reads together. For example, in demultiplexing, bags may be used to separate sequence reads as belonging to particular samples. As another example, in de-duping, bags may be used to identify sequence reads pertaining to amplicons of the same original DNA fragment in a sample.
[0048] As used herein, the term “sensitivity” or “true positive rate” (TPR) refers to the number of true positives divided by the sum of the number of true positives and false negatives. Sensitivity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly has a condition. For example, sensitivity can characterize the ability of a method to correctly identify the number of subjects within a population having cancer. In another example, sensitivity can characterize the ability of a method to correctly identify the one or more markers indicative of cancer.
[0049] As used herein, the term “specificity” or “true negative rate” (TNR) refers to the number of true negatives divided by the sum of the number of true negatives and false positives. Specificity can characterize the ability of an assay or method to correctly identify a proportion of the population that truly does not have a condition. For example, specificity can characterize the ability of a method to correctly identify the number of subjects within a population not having cancer. In another example, specificity characterizes the ability of a method to correctly identify one or more markers indicative of cancer. [0050] As used herein, the term “subject” refers to any living or non-living organism, including but not limited to a human (e.g., a male human, female human, fetus, pregnant female, child, or the like), a non-human animal, a plant, a bacterium, a fungus or a protist. Any human or nonhuman animal can serve as a subject, including but not limited to mammal, reptile, avian, amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine (e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama, alpaca), monkey, ape (e. ., gorilla, chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish, dolphin, whale, and shark. In some embodiments, a subject is a male or female of any stage (e.g., a man, a woman or a child). A subject from whom a sample is taken, or is treated by any of the methods or compositions described herein can be of any age and can be an adult, infant or child.
[0051] As used herein, the term “tissue” can correspond to a group of cells that group together as a functional unit. More than one type of cell can be found in a single tissue. Different types of tissue may consist of different types of cells (e.g., hepatocytes, alveolar cells or blood cells), but also can correspond to tissue from different organisms (mother vs. fetus) or to healthy cells vs. tumor cells. The term “tissue” can generally refer to any group of cells found in the human body (e.g. , heart tissue, lung tissue, kidney tissue, nasopharyngeal tissue, oropharyngeal tissue). In some aspects, the term “tissue” or “tissue type” can be used to refer to a tissue from which a cell-free nucleic acid originates. In one example, viral nucleic acid fragments can be derived from blood tissue. In another example, viral nucleic acid fragments can be derived from tumor tissue.
[0052] As used herein, the term “genomic” refers to a characteristic of the genome of an organism. Examples of genomic characteristics include, but are not limited to, those relating to the primary nucleic acid sequence of all or a portion of the genome (e.g. , the presence or absence of a nucleotide polymorphism, indel, sequence rearrangement, mutational frequency, etc.), the copy number of one or more particular nucleotide sequences within the genome (e.g., copy number, allele frequency fractions, single chromosome or entire genome ploidy, etc.), the epigenetic status of all or a portion of the genome (e.g., covalent nucleic acid modifications such as methylation, histone modifications, nucleosome positioning, etc.), the expression profile of the organism’s genome (e.g., gene expression levels, isotype expression levels, gene expression ratios, etc.). [0053] The terminology used herein is for the purpose of describing particular cases only and is not intended to be limiting. As used herein, the singular forms “a,” “an” and “the” arc intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, to the extent that the terms “including,” “includes,” “having,” “has,” “with,” or variants thereof are used in either the detailed description and/or the claims, such terms are intended to be inclusive in a manner similar to the term “comprising.”
DISEASE CLASSIFICATION WITH GROUP TESTING
[0054] The present disclosure provides devices and methods for disease diagnosis and classification with improved efficiency. In some embodiments, the method can be viewed as including two stages, a group testing/pre-screening stage and an individual testing/confirmation stage. Both stages, in some embodiments, entail disease classification with sequencing data, which is described in more detail below.
A. General Workflow of Sample Processing and Classification in Each Stage
[0055] FIG. 1 is an exemplary flowchart describing an overall workflow 100 of disease classification (e.g., cancer classification) of a sample, according to one or more embodiments.
[0056] In step 110, a biological sample is collected, for instance, by a healthcare provider. Depending on the stages, the sample may be a biological sample from an individual patient, or a pooled sample with multiple individual samples, without limitation. Example biological samples include, but are not limited to, tissue biopsy, blood, whole blood, plasma, serum, urine, cerebrospinal fluid, fecal, saliva, sweat, tears, pleural fluid, pericardial fluid, or peritoneal fluid of the subject. The sample includes genetic material belonging to the individual, which may be extracted and sequenced for cancer classification. Once the sample is collected, the sample is provided to a sequencing device. Along with the sample, the healthcare provider may collect other information relating to the individual, e.g., biological sex, age, ethnicity, smoking status, any prior diagnoses, etc.
[0057] A sequencing device performs sample sequencing 120. A lab clinician may perform one or more processing steps to the sample in preparation of sequencing. Once prepared, the clinician loads the sample in the sequencing device. The sequencing device generally extracts and isolates fragments of nucleic acid that are sequenced to determine a sequence of nucleobases corresponding to the fragments. Sample sequencing includes sample treatment in preparation for sequencing of the fragments in the sample. Sample treatment may include one or more ligation steps, and amplification of the nucleic acid material.
[0058] In one or more embodiments, the sample treatment includes ligation of a sample barcode. The sample barcode is a polynucleotide sequence that is substantially unique to each sample. The sample barcode is ligated onto each fragment in a sample prior to indexing and sequencing.
[0059] Each polynucleotide sequence in a sample can also be ligated to a unique molecule identifier (UMI). The unique molecule identifiers are also polynucleotide sequences that are ligated onto each fragment originating in the sample, e.g., prior to amplification. The unique molecule identifiers may be utilized in de-duping sequence reads to identify unique fragments originating in the sample.
[0060] Different sequencing processes include Sanger sequencing, fragment analysis, and nextgeneration sequencing. Sequencing may be whole-genome sequencing or targeted sequencing with a target panel. In context of DNA methylation, bisulfite sequencing can determine methylations status through bisulfite conversion of unmethylated cytosines at CpG sites. Sample sequencing 120 yields sequences for a plurality of nucleic acid fragments in the sample. In one or more embodiments, the sequences may include methylation state vectors, wherein each methylation state vector describes the methylation statuses for CpG sites on a fragment.
[0061] For methylation-related sequencing, cfDNA fragments from an individual can be first treated, for example by converting unmethylated cytosines to uracils, prior to sequencing. The sequence reads can then be compared to a reference genome to identify the methylation states at specific CpG sites within the DNA fragments. Each CpG site may be methylated or unmethylated. Identification of anomalously methylated fragments, in comparison to healthy individuals, may provide insight into a subject’s cancer status. As is well known in the art, DNA methylation anomalies (compared to healthy controls) can cause different effects, which may contribute to cancer. Various challenges arise in the identification of anomalously methylated cfDNA fragments. [0062] Methylation can typically occur in deoxyribonucleic acid (DNA) when a hydrogen atom on the pyrimidine ring of a cytosine base is converted to a methyl group, forming 5- methylcytosine. In particular, methylation can occur at dinucleotides of cytosine and guanine referred to herein as “CpG sites”. In other instances, methylation may occur at a cytosine not part of a CpG site or at another nucleotide that is not cytosine; however, these are rarer occurrences. Anomalous DNA methylation can be identified as hypermethylation or hypomethylation, both of which may be indicative of cancer status. Hypermethylation and hypomethylation can be characterized for a DNA fragment, if the DNA fragment comprises more than a threshold number of CpG sites with more than a threshold percentage of those CpG sites being methylated or unmethylated.
[0063] The principles described herein can be equally applicable for the detection of methylation in a non-CpG context, including non-cytosine methylation. In such embodiments, the wet laboratory assay used to detect methylation may vary from those described herein. Further, the methylation state vectors discussed herein may contain elements that are generally sites where methylation has or has not occurred (even if those sites are not CpG sites specifically). With that substitution, the remainder of the processes described herein can be the same, and consequently the inventive concepts described herein can be applicable to those other forms of methylation.
[0064] An analytics system performs pre-analysis processing 130. Pre-analysis processing 130 may include, but not limited to, demultiplexing, de-duplication of sequence reads, determining metrics relating to coverage, identification of contamination events, determining whether the sample is contaminated, remedial measures to contamination events, calling sequencing error, performing remedial measures, etc. As a result of the pre-analysis processing 130, the analytics system collects a set of sequence reads pertaining to the sample usable for the analyses 140.
[0065] The analytics system performs one or more analyses 140. The analyses are statistical analyses or application of one or more trained models to predict at least a cancer status of the individual from whom the sample is derived. Different genetic features may be evaluated and considered, such as methylation of CpG sites, single nucleotide polymorphisms (SNPs), insertions or deletions (indels), other types of genetic mutation, etc. [0066] In context of methylation, analyses 140 may include anomalous methylation identification 142, feature extraction 144, and applying a cancer classifier 146 to determine a cancer prediction. In one or more embodiments of feature extraction, the analytics system may utilize one or more age covariate prediction models to generate one or more age covariate residuals as features to cancer classification. The cancer classifier 146 inputs the extracted features to determine a cancer prediction. The cancer prediction may be a label or a value. The label may indicate a particular cancer state, e.g., binary labels can indicate presence or absence of cancer, multiclass labels can indicate one or more cancer types from a plurality of cancer types that are screened for. The value may indicate a likelihood of a particular cancer state, e.g., a likelihood of cancer, and/or a likelihood of a particular cancer type.
[0067] The analytics system returns the prediction 150. The prediction, for an individual sample obtained from a single patient, may be whether the patient has cancer. For a pooled sample that includes samples from multiple patients, the prediction may be whether at least one of the patients has cancer, suggesting that further testing of individual samples is needed.
Al. Generation of Methylation State Vectors
[0068] For methylation analysis, the DNA fragments can be treated prior to the sequencing to convert unmethylated cytosines to uracils. In one embodiment, the method uses a bisulfite treatment of the DNA which converts the unmethylated cytosines to uracils without converting the methylated cytosines. For example, a commercial kit such as the EZ DNA Methylation™ - Gold, EZ DNA Methylation™ - Direct or an EZ DNA Methylation™ - Lightning kit (available from Zymo Research Corp (Irvine, CA)) is used for the bisulfite conversion. In another embodiment, the conversion of unmethylated cytosines to uracils is accomplished using an enzymatic reaction. For example, the conversion can use a commercially available kit for conversion of unmethylated cytosines to uracils, such as APOBEC-Seq (NEBiolabs, Ipswich, MA).
[0069] From the converted DNA fragments, a sequencing library can be prepared. During library preparation, unique molecular identifiers (UM1) can be added to the nucleic acid molecules (e.g., DNA molecules) through adapter ligation. The UMIs can be short nucleic acid sequences (e.g., 4- 10 base pairs) that are added to ends of DNA fragments (e.g., DNA molecules fragmented by physical shearing, enzymatic digestion, and/or chemical fragmentation) during adapter ligation. UMIs can be degenerate base pairs that serve as a unique tag that can be used to identify sequence reads originating from a specific DNA fragment. During PCR amplification following adapter ligation, the UMIs can be replicated along with the attached DNA fragment. This can provide a way to identify sequence reads that came from the same original fragment in downstream analysis.
[0070] Optionally, the sequencing library may be enriched for DNA fragments, or genomic regions, that are informative for cancer status using a plurality of hybridization probes. The hybridization probes are short oligonucleotides capable of hybridizing to particularly specified DNA fragments, or targeted regions, and enriching for those fragments or regions for subsequent sequencing and analysis. Hybridization probes may be used to perform a targeted, high-depth analysis of a set of specified CpG sites of interest to the researcher. Hybridization probes can be tiled across one or more target sequences at a coverage of IX, 2X, 3X, 4X, 5X, 6X, 7X, 8X, 9X, 10X, or more than 10X. For example, hybridization probes tiled at a coverage of 2X comprises overlapping probes such that each portion of the target sequence is hybridized to 2 independent probes. Hybridization probes can be tiled across one or more target sequences at a coverage of less than IX.
[0071] In one embodiment, the hybridization probes are designed to enrich for DNA molecules that have been treated (e.g., using bisulfite) for conversion of unmethylated cytosines to uracils. During enrichment, hybridization probes (also referred to herein as “probes”) can be used to target and pull down nucleic acid fragments informative for the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g. , cancer class or tissue of origin). The probes may be designed to anneal (or hybridize) to a target (complementary) strand of DNA. The target strand may be the “positive” strand (e.g., the strand transcribed into mRNA, and subsequently translated into a protein) or the complementary “negative” strand. The probes may range in length from 10s, 100s, or 1000s of base pairs. The probes can be designed based on a methylation site panel. The probes can be designed based on a panel of targeted genes to analyze particular' mutations or target regions of the genome (e.g., of the human or another organism) that are suspected to correspond to certain cancers or other types of diseases. Moreover, the probes may cover overlapping portions of a target region. [0072] Once prepared, the sequencing library or a portion thereof can be sequenced to obtain a plurality of sequence reads. One or more alternative sequencing methods can be used for obtaining sequence reads from nucleic acids in a biological sample. The one or more sequencing methods can comprise any form of sequencing that can be used to obtain a number of sequence reads measured from nucleic acids (e.g., cell-free nucleic acids), including, but not limited to, high- throughput sequencing systems such as the Roche 454 platform, the Applied Biosystems SOLID platform, the Helicos True Single Molecule DNA sequencing technology, the sequencing-by- hybridization platform from Affymetrix Inc., the single-molecule, real-time (SMRT) technology of Pacific Biosciences, the sequencing-by- synthesis platforms from 454 Life Sciences, Illumina/Solexa and Helicos Biosciences, and the sequencing-by-ligation platform from Applied Biosystems. The ION TORRENT technology from Life technologies and Nanopore sequencing can also be used to obtain sequence reads from the nucleic acids (e.g., cell-free nucleic acids) in the biological sample. Sequencing-by-synthesis and reversible terminator-based sequencing (e.g., Illumina’s Genome Analyzer; Genome Analyzer II; HISEQ 2000; HISEQ 4500 (Illumina, San Diego Calif.)) can be used to obtain sequence reads from the cell-free nucleic acid obtained from a biological sample of a training subject in order to form the genotypic dataset. Millions of cell- free nucleic acid (e.g., DNA) fragments can be sequenced in parallel. In one example of this type of sequencing technology, a flow cell is used that contains an optically transparent slide with eight individual lanes on the surfaces of which are bound oligonucleotide anchors (e.g., adaptor primers). A cell-free nucleic acid sample can include a signal or tag that facilitates detection. The acquisition of sequence reads from the cell-free nucleic acid obtained from the biological sample can include obtaining quantification information of the signal or tag via a variety of techniques such as, for example, flow cytometry, quantitative polymerase chain reaction (qPCR), gel electrophoresis, gene-chip analysis, microarray, mass spectrometry, cytofluorimetric analysis, fluorescence microscopy, confocal laser scanning microscopy, laser scanning cytometry, affinity chromatography, manual batch mode separation, electric field suspension, sequencing, and combination thereof.
[0073] The one or more sequencing methods can include a whole-genome sequencing assay. A whole-genome sequencing assay can comprise a physical assay that generates sequence reads for a whole genome or a substantial portion of the whole genome which can be used to determine large variations such as copy number variations or copy number aberrations. Such a physical assay may employ whole-genome sequencing techniques or whole-exome sequencing techniques. A wholc-gcnomc sequencing assay can have an average sequencing depth of at least lx, 2x, 3x, 4x, 5x, 6x, 7x, 8x, 9x, lOx, at least 20x, at least 30x, or at least 40x across the genome of the test subject. In some embodiments, the sequencing depth is about 30,000x. The one or more sequencing methods can comprise a targeted panel sequencing assay. A targeted panel sequencing assay can have an average sequencing depth of at least 50,000x, at least 55,000x, at least 60,000x, or at least 70,000x sequencing depth for the targeted panel of genes. The targeted panel of genes can comprise between 450 and 500 genes. The targeted panel of genes can comprise a range of 500+5 genes, a range of 500+10 genes, or a range of 500+25 genes.
[0074] The one or more sequencing methods can include paired-end sequencing. The one or more sequencing methods can generate a plurality of sequence reads. The plurality of sequence reads can have an average length ranging between 10 and 700, between 50 and 400, or between 100 and 300. The one or more sequencing methods can comprise a methylation sequencing assay. The methylation sequencing can be i) whole-genome methylation sequencing or ii) targeted DNA methylation sequencing using a plurality of nucleic acid probes. For example, the methylation sequencing is whole-genome bisulfite sequencing (e.g., WGBS). The methylation sequencing can be a targeted DNA methylation sequencing using a plurality of nucleic acid probes targeting the most informative regions of the methylome, a unique methylation database and prior prototype whole-genome and targeted sequencing assays.
[0075] The methylation sequencing can detect one or more 5-methylcytosine (5mC) and/or 5- hydroxymethylcytosine (5hmC) in respective nucleic acid methylation fragments. The methylation sequencing can comprise conversion of one or more unmethylated cytosines or one or more methylated cytosines, in respective nucleic acid methylation fragments, to a corresponding one or more uracils. The one or more uracils can be detected during the methylation sequencing as one or more corresponding thymines. The conversion of one or more unmethylated cytosines or one or more methylated cytosines can comprise a chemical conversion, an enzymatic conversion, or combinations thereof.
[0076] For example, bisulfite conversion involves converting cytosine to uracil while leaving methylated cytosines e.g., 5-methylcytosine or 5-mC) intact. In some DNA, about 95% of cytosines may not methylated in the DNA, and the resulting DNA fragments may include many uracils which arc represented by thymines. Enzymatic conversion processes may be used to treat the nucleic acids prior to sequencing, which can be performed in various ways. One example of a bi sulfite-free conversion comprises a bisulfite-free and base-resolution sequencing method, TET- assisted pyridine borane sequencing (TAPS), for non-destructive and direct detection of 5- methylcytosine and 5 -hydroxy methylcytosine without affecting unmodified cytosines. The methylation state of a CpG site in the corresponding plurality of CpG sites in the respective nucleic acid methylation fragment can be methylated when the CpG site is determined by the methylation sequencing to be methylated, and unmethylated when the CpG site is determined by the methylation sequencing to not be methylated.
[0077] A methylation sequencing assay (e.g., WGBS and/or targeted methylation sequencing) can have an average sequencing depth including but not limited to up to about l,000x, 2,000x, 3,000x, 5,000x, 10,000x, 15,000x, 20,000x, or 30,000x. The methylation sequencing can have a sequencing depth that is greater than 30,000x, e.g., at least 40,000x or 50,000x. A whole-genome bisulfite sequencing method can have an average sequencing depth of between 20x and 50x, and a targeted methylation sequencing method has an average effective depth of between lOOx and lOOOx, where effective depth can be the equivalent whole-genome bisulfite sequencing coverage for obtaining the same number of sequence reads obtained by targeted methylation sequencing.
[0078] The methylation sequencing of nucleic acids and the resulting one or more methylation state vectors can be used to obtain a plurality of nucleic acid methylation fragments. Each corresponding plurality of nucleic acid methylation fragments (e.g., for each respective genotypic dataset) can comprise more than 100 nucleic acid methylation fragments. An average number of nucleic acid methylation fragments across each corresponding plurality of nucleic acid methylation fragments can comprise 1000 or more nucleic acid methylation fragments, 5000 or more nucleic acid methylation fragments, 10,000 or more nucleic acid methylation fragments, 20,000 or more nucleic acid methylation fragments, or 30,000 or more nucleic acid methylation fragments. An average number of nucleic acid methylation fragments across each corresponding plurality of nucleic acid methylation fragments can be between 10,000 nucleic acid methylation fragments and 50,000 nucleic acid methylation fragments. The corresponding plurality of nucleic acid methylation fragments can comprise one thousand or more, ten thousand or more, 100 thousand or more, one million or more, ten million or more, 100 million or more, 500 million or more, one billion or more, two billion or more, three billion or more, four billion or more, five billion or more, six billion or more, seven billion or more, eight billion or more, nine billion or more, or 10 billion or more nucleic acid methylation fragments. An average length of a corresponding plurality of nucleic acid methylation fragments can be between 140 and 480 nucleotides.
[0079] The sequence reads may be in a computer-readable, digital format for processing and interpretation by computer software. The sequence reads may be aligned to a reference genome to determine alignment position information. The alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read. Alignment position information may also include sequence read length, which can be determined from the beginning position and end position. A region in the reference genome may be associated with a gene or a segment of a gene. A sequence read can be comprised of a read pair denoted as R1 and R2. For example, the first read R} may be sequenced from a first end of a nucleic acid fragment whereas the second read R2 may be sequenced from the second end of the nucleic acid fragment. Therefore, nucleotide base pairs of the first read R± and second read R2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome. Alignment position information derived from the read pair R± and R2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., /?x) and an end position in the reference genome that corresponds to an end of a second read (e.g., R2). In other words, the beginning position and end position in the reference genome can represent the likely location within the reference genome to which the nucleic acid fragment corresponds. An output file having SAM (sequence alignment map) format or BAM (binary) format may be generated and output for further analysis such as methylation state determination.
[0080] From the sequence reads, the analytics system determines a location and methylation state for each CpG site based on alignment to a reference genome. The analytics system generates a methylation state vector for each fragment specifying a location of the fragment in the reference genome (e.g., as specified by the position of the first CpG site in each fragment, or another similar metric), a number of CpG sites in the fragment, and the methylation state of each CpG site in the fragment whether methylated (e.g., denoted as M), unmethylated (e.g., denoted as U), or indeterminate (e.g., denoted as I). Observed states can be states of methylated and unmethylated; whereas, an unobserved state is indeterminate. Indeterminate methylation states may originate from sequencing errors and/or disagreements between methylation states of a DNA fragment's complementary strands. The methylation state vectors may be stored in temporary or persistent computer memory for later use and processing. Further, the analytics system may remove duplicate reads or duplicate methylation state vectors from a single sample. The analytics system may determine that a certain fragment with one or more CpG sites has an indeterminate methylation status over a threshold number or percentage, and may exclude such fragments or selectively include such fragments but build a model accounting for such indeterminate methylation statuses.
[0081] FIG. 2 is an exemplary illustration of methylation sequencing a cfDNA molecule to obtain a methylation state vector, according to one or more embodiments. As an example, the analytics system receives a cfDNA molecule 242 that, in this example, contains three CpG sites. As shown, the first and third CpG sites of the cfDNA molecule 242 are methylated 244. During the treatment step 250, the cfDNA molecule 242 is converted to generate a converted cfDNA molecule 252. During the treatment 250, the second CpG site which was unmethylated has its cytosine converted to uracil. However, the first and third CpG sites were not converted.
[0082] After conversion, a sequencing library is prepared and the molecule sequenced 260 to generate a sequence read 262. The analytics system aligns the sequence read 262 to a reference genome 264. The reference genome 264 provides the context as to what position in a human genome the fragment cfDNA originates from. In this simplified example, the analytics system aligns 270 the sequence read 262 such that the three CpG sites correlate to CpG sites 23, 24, and 25 (arbitrary reference identifiers used for convenience of description). The analytics system can thus generate information both on methylation status of all CpG sites on the cfDNA molecule 242 and the position in the human genome that the CpG sites map to. As shown, the CpG sites on sequence read 262 which are methylated are read as cytosines. In this example, the cytosines appear in the sequence read 262 only in the first and third CpG site which allows one to infer that the first and third CpG sites in the original cfDNA molecule are methylated. Whereas, the second CpG site can be read as a thymine (U is converted to T during the sequencing process), and thus, one can infer that the second CpG site is unmethylated in the original cfDNA molecule. With these two pieces of information, the methylation status and location, the analytics system generates 270 a methylation state vector 272 for the fragment cfDNA 242. In this example, the resulting methylation state vector 272 is < M23, U24, M25 >, wherein M corresponds to a methylated CpG site, U corresponds to an unmethylated CpG site, and the subscript number corresponds to a position of each CpG site in the reference genome.
A2. Identifying Anomalous Fragments
[0083] The analytics system determines anomalous fragments for a sample using the sample’s methylation state vectors. For each fragment in a sample, the analytics system determines whether the fragment is an anomalous fragment using the methylation state vector corresponding to the fragment. In one embodiment, the analytics system calculates a p-value score for each methylation state vector describing a probability of observing that methylation state vector or other methylation state vectors even less probable in the healthy control group.
[0084] The analytics system may determine fragments with a methylation state vector having below a threshold p-value score as anomalous fragments. In another embodiment, the analytics system further labels fragments with at least some number of CpG sites that have over some threshold percentage of methylation or unmethylation as hypermethylated and hypomethylated fragments, respectively. A hypermethylated fragment or a hypomethylated fragment may also be referred to as an unusual fragment with extreme methylation (UFXM). In other embodiments, the analytics system may implement various other probabilistic models for determining anomalous fragments. Examples of other probabilistic models include a mixture model, a deep probabilistic model, etc. In some embodiments, the analytics system may use any combination of the processes described below for identifying anomalous fragments. With the identified anomalous fragments, the analytics system may filter the set of methylation state vectors for a sample for use in other processes, e.g., for use in training and deploying a cancer classifier.
[0085] In one embodiment, the analytics system calculates a p-value score for each methylation state vector compared to methylation state vectors from fragments in a healthy control group. The p-value score describes a probability of observing the methylation status matching that methylation state vector or other methylation state vectors even less probable in the healthy control group. In order to determine a DNA fragment to be anomalously methylated, the analytics system uses a healthy control group with a majority of fragments that are normally methylated. When conducting this probabilistic analysis for determining anomalous fragments, the determination holds weight in comparison with the group of control subjects that make up the healthy control group. To ensure robustness in the healthy control group, the analytics system may select some threshold number of healthy individuals to source samples including DNA fragments.
[0086] To create a healthy control group data structure, the analytics system receives a plurality of DNA fragments (e.g., cfDNA) from a plurality of healthy individuals. A methylation state vector is identified for each fragment.
[0087] With each fragment’s methylation state vector, the analytics system subdivides the methylation state vector into strings of CpG sites. In one embodiment, the analytics system subdivides the methylation state vector such that the resulting strings are all less than a given length. For example, a methylation state vector of length 11 may be subdivided into strings of length less than or equal to 3 would result in 9 strings of length 3, 10 strings of length 2, and 11 strings of length 1. In another example, a methylation state vector of length 7 being subdivided into strings of length less than or equal to 4 would result in 4 strings of length 4, 5 strings of length 3, 6 strings of length 2, and 7 strings of length 1. If a methylation state vector is shorter than or the same length as the specified string length, then the methylation state vector may be converted into a single string containing all of the CpG sites of the vector.
[0088] The analytics system tallies the strings by counting, for each possible CpG site and possibility of methylation states in the vector, the number of strings present in the control group having the specified CpG site as the first CpG site in the string and having that possibility of methylation states. For example, at a given CpG site and considering string lengths of 3, there are 2A3 or 8 possible string configurations. At that given CpG site, for each of the 8 possible string configurations, the analytics system tallies how many occurrences of each methylation state vector possibility come up in the control group. Continuing this example, this may involve tallying the following quantities: < Mx, Mx+i, Mx+2 >, < Mx, Mx+i, Ux+2 >, . . ., < Ux, Ux+i, Ux+2 > for each starting CpG site x in the reference genome. The analytics system creates the data structure storing the tallied counts for each starting CpG site and string possibility. [0089] To identify anomalously methylated fragments from an individual, the analytics system generates methylation state vectors from cfDNA fragments of the subject. For a given methylation state vector, the analytics system enumerates all possibilities of methylation state vectors having the same starting CpG site and same length (i.e., set of CpG sites) in the methylation state vector. As each methylation state is generally either methylated or unmethylated there are effectively two possible states at each CpG site, and thus the count of distinct possibilities of methylation state vectors depends on a power of 2, such that a methylation state vector of length n would be associated with 211 possibilities of methylation state vectors. With methylation state vectors inclusive of indeterminate states for one or more CpG sites, the analytics system may enumerate possibilities of methylation state vectors considering only CpG sites that have observed states.
[0090] The analytics system calculates the probability of observing each possibility of methylation state vector for the identified starting CpG site and methylation state vector length by accessing the healthy control group data structure. In one embodiment, calculating the probability of observing a given possibility uses a Markov chain probability to model the joint probability calculation. In other embodiments, calculation methods other than Markov chain probabilities are used to determine the probability of observing each possibility of methylation state vector.
[0091] The analytics system calculates a p-value score for the methylation state vector using the calculated probabilities for each possibility. In one embodiment, this includes identifying the calculated probability corresponding to the possibility that matches the methylation state vector in question. Specifically, this is the possibility having the same set of CpG sites, or similarly the same starting CpG site and length as the methylation state vector. The analytics system sums the calculated probabilities of any possibilities having probabilities less than or equal to the identified probability to generate the p-value score.
[0092] This p-value represents the probability of observing the methylation state vector of the fragment or other methylation state vectors even less probable in the healthy control group. A low p-value score, thereby, generally corresponds to a methylation state vector which is rare in a healthy individual, and which causes the fragment to be labeled anomalously methylated, relative to the healthy control group. A high p-value score generally relates to a methylation state vector is expected to be present, in a relative sense, in a healthy individual. If the healthy control group is a non-cancerous group, for example, a low p-value indicates that the fragment is anomalous methylated relative to the non-canccr group, and therefore possibly indicative of the presence of cancer in the test subject.
[0093] As above, the analytics system calculates p-value scores for each of a plurality of methylation state vectors, each representing a cfDNA fragment in the test sample. To identify which of the fragments are anomalously methylated, the analytics system may filter the set of methylation state vectors based on their p-value scores. In one embodiment, filtering is performed by comparing the p-values scores against a threshold and keeping only those fragments below the threshold. This threshold p-value score could be on the order of 0.1, 0.01, 0.001, 0.0001, or similar.
[0094] According to example results from the process 400, the analytics system yields a median (range) of 2,800 (1,500-12,000) fragments with anomalous methylation patterns for participants without cancer in training, and a median (range) of 3,000 (1,200-220,000) fragments with anomalous methylation patterns for participants with cancer in training.
[0095] In one embodiment, the analytics system uses a sliding window to determine possibilities of methylation state vectors and calculate p-values. Rather than enumerating possibilities and calculating p-values for entire methylation state vectors, the analytics system enumerates possibilities and calculates p-values for only a window of sequential CpG sites, where the window is shorter in length (of CpG sites) than at least some fragments (otherwise, the window would serve no purpose). The window length may be static, user determined, dynamic, or otherwise selected.
[0096] In calculating p-values for a methylation state vector larger than the window, the window identifies the sequential set of CpG sites from the vector within the window starting from the first CpG site in the vector. The analytic system calculates a p-value score for the window including the first CpG site. The analytics system then “slides” the window to the second CpG site in the vector, and calculates another p-value score for the second window. Thus, for a window size / and methylation vector length m, each methylation state vector will generate m-l+1 p-value scores. After completing the p-value calculations for each portion of the vector, the lowest p-value score from all sliding windows is taken as the overall p-value score for the methylation state vector. In another embodiment, the analytics system aggregates the p-value scores for the methylation state vectors to generate an overall p-value score. [0097] Using the sliding window helps to reduce the number of enumerated possibilities of methylation state vectors and their corresponding probability calculations that would otherwise need to be performed. To give a realistic example, it is possible for fragments to have upwards of 54 CpG sites. Instead of computing probabilities for 2A54 (~1.8xl0A16) possibilities to generate a single p-score, the analytics system can instead use a window of size 5 (for example) which results in 50 p-value calculations for each of the 50 windows of the methylation state vector for that fragment. Each of the 50 calculations enumerates 2 5 (32) possibilities of methylation state vectors, which total results in 50x2A5 (1.6xl0A3) probability calculations. This results in a vast reduction of calculations to be performed, with no meaningful hit to the accurate identification of anomalous fragments.
[0098] In embodiments with indeterminate states, the analytics system may calculate a p-value score summing out CpG sites with indeterminates states in a fragment’s methylation state vector. The analytics system identifies all possibilities that have consensus with the all methylation states of the methylation state vector excluding the indeterminate states. The analytics system may assign the probability to the methylation state vector as a sum of the probabilities of the identified possibilities. As an example, the analytics system calculates a probability of a methylation state vector of < Mi, I2, U3 > as a sum of the probabilities for the possibilities of methylation state vectors of < Mi, M2, U3 > and < Mi, U2, U3 > since methylation states for CpG sites 1 and 3 are observed and in consensus with the fragment’s methylation states at CpG sites 1 and 3. This method of summing out CpG sites with indeterminate states uses calculations of probabilities of possibilities up to 2Ai, wherein i denotes the number of indeterminate states in the methylation state vector. In additional embodiments, a dynamic programming algorithm may be implemented to calculate the probability of a methylation state vector with one or more indeterminate states. Advantageously, the dynamic programming algorithm operates in linear computational time.
[0099] In one embodiment, the computational burden of calculating probabilities and/or p-value scores may be further reduced by caching at least some calculations. For example, the analytic system may cache in transitory or persistent memory calculations of probabilities for possibilities of methylation state vectors (or windows thereof). If other fragments have the same CpG sites, caching the possibility probabilities allows for efficient calculation of p-score values without needing to re-calculate the underlying possibility probabilities. Equivalently, the analytics system may calculate p-value scores for each of the possibilities of methylation state vectors associated with a set of CpG sites from vector (or window thereof). The analytics system may cache the p- value scores for use in determining the p-value scores of other fragments including the same CpG sites. Generally, the p-value scores of possibilities of methylation state vectors having the same CpG sites may be used to determine the p-value score of a different one of the possibilities from the same set of CpG sites.
A3. Example Analytics System
[0100] FIG. 3 is a flowchart of devices for sequencing nucleic acid samples according to one embodiment. This illustrative flowchart includes devices such as a sequencer 320 and an analytics system 300. The sequencer 320 and the analytics system 300 may work in tandem to perform one or more steps in the sequencing and analytics processes.
[0101] In various embodiments, the sequencer 320 receives an enriched nucleic acid sample 310. As shown in FIG. 3, the sequencer 320 can include a graphical user interface 325 that enables user interactions with particular tasks (e.g., initiate sequencing or terminate sequencing) as well as one more loading stations 330 for loading a sequencing cartridge including the enriched fragment samples and/or for loading necessary buffers for performing the sequencing assays. Therefore, once a user of the sequencer 320 has provided the necessary reagents and sequencing cartridge to the loading station 330 of the sequencer 320, the user can initiate sequencing by interacting with the graphical user interface 325 of the sequencer 320. Once initiated, the sequencer 320 performs the sequencing and outputs the sequence reads of the enriched fragments from the nucleic acid sample 310.
[0102] In some embodiments, the sequencer 320 is communicatively coupled with the analytics system 300. The analytics system 300 includes some number of computing devices used for processing the sequence reads for various applications such as assessing methylation status at one or more CpG sites, variant calling or quality control. The sequencer 320 may provide the sequence reads in a BAM file format to the analytics system 300. The analytics system 300 can be communicatively coupled to the sequencer 320 through a wireless, wired, or a combination of wireless and wired communication technologies. Generally, the analytics system 300 is configured with a processor and non-transitory computer-readable storage medium storing computer instructions that, when executed by the processor, cause the processor to process the sequence reads or to perform one or more steps of any of the methods or processes disclosed herein.
[0103] In various embodiments, for example when a paired-end sequencing process is used, a sequence read is comprised of a read pair denoted as R_1 and R_2. For example, the first read R_1 may be sequenced from a first end of a double- stranded DNA (dsDNA) molecule whereas the second read R_2 may be sequenced from the second end of the double- stranded DNA (dsDNA). Therefore, nucleotide base pairs of the first read R_1 and second read R_2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome. Alignment position information derived from the read pair R_1 and R_2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R_l) and an end position in the reference genome that corresponds to an end of a second read (e.g., R_2). In other words, the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds. An output file having SAM (sequence alignment map) format or BAM (binary) format may be generated and output for further analysis.
B. Sample Labeling and Pooling for Group Stage
[0104] FIG. 4 illustrates a process in which unique molecule identifiers (UMIs) are ligated to each fragment in a sample, and then a common sample barcode (SB) is added to all fragments in the sample, which is unique among all samples being processed in a pool, prior to sequencing.
[0105] As shown in FIG. 4, in a first ligation (430) UMI 1 415 is ligated onto fragment 1 410, and UMI 2 425 is ligated onto fragment 2 420. UMI 1 415 and UMI 2 425 are substantially distinct. The diagonal hatched boxes represent nucleotides on the non-ligated end of the UMIs. The diagonal hatched boxes may be primers for binding enzymes to the fragments. The UMIs are ligated onto the 3’ end of the original fragments.
[0106] Amplification (440) occurs to amplify the original fragments ligated with UMIs. The resulting amplicons may comprise synthetically generated fragments and/or the original fragments. For example, fragment 1 410 is copied resulting in three amplicons, fragment 1 A 412, fragment IB 412, and fragment 1C 412. Each of the amplicons of fragment 1 410 includes UMI 1 415 that was ligated onto the original fragment 1 410. Fragment 2 420 is amplified resulting in two amplicons, fragment 2A 422 and fragment 2B 422, each with UMI 2425.
[0107] At second ligation (450), a sample barcode (SB) is appended to the amplicons. In particular, all amplicons receive sample barcode (SB) 405 at the 5’ end, which is opposite the UMIs. The SB 405 may also comprise one or more nucleotides at the non-ligating end to protect the sample barcode.
[0108] Once each fragment in a sample is labeled with a sample-specific sample barcode (SB), two or more samples may be pooled to form a pooled sample for group testing.
C. Group Stage Classification
[0109] Disease classifiers can be trained and/or tuned to receive a feature vector for a pooled sample and determine whether the pooled sample contains an individual sample from a test subject that has cancer or, more specifically, a particular cancer type.
[0110] The cancer classifier comprises a plurality of classification parameters and a function representing a relation between the feature vector as input and the cancer prediction as output determined by the function operating on the input feature vector with the classification parameters. In one embodiment, the feature vectors input into the cancer classifier are based on a set of anomalous fragments determined from the pooled sample. The anomalous fragments may be determined via a process as described in Section A2 above.
C 1. Training of Disease Classifiers
[0111] A cancer classifier can be trained by first obtaining a plurality of training samples each having a set of anomalous fragments and a label of a cancer type. The plurality of training samples includes any combination of samples from healthy individuals with a general label of “noncancer,” samples from subjects with a general label of “cancer” or a specific label (e.g., “breast cancer,” “lung cancer,” etc.). The training samples from subjects for one cancer type may be termed a cohort for that cancer type or a cancer type cohort.
[0112] The analytics system determines, for each training sample, a feature vector based on the set of anomalous fragments of the training sample. The analytics system calculates an anomaly score for each CpG site in an initial set of CpG sites. The initial set of CpG sites may be all CpG sites in the human genome or some portion thereof - which may be on the order of 104, 105, 106, 107, 108, etc. In one embodiment, the analytics system defines the anomaly score for the feature vector with a binary scoring based on whether there is an anomalous fragment in the set of anomalous fragments that encompasses the CpG site. In another embodiment, the analytics system defines the anomaly score based on a count of anomalous fragments overlapping the CpG site. In one example, the analytics system may use a trinary scoring assigning a first score for lack of presence of anomalous fragments, a second score for presence of a few anomalous fragments, and a third score for presence of more than a few anomalous fragments. For example, the analytics system counts 5 anomalous fragment in a sample that overlap the CpG site and calculates an anomaly score based on the count of 5.
[0113] Once all anomaly scores are determined for a training sample, the analytics system determines the feature vector as a vector of elements including, for each element, one of the anomaly scores associated with one of the CpG sites in an initial set. The analytics system normalizes the anomaly scores of the feature vector based on a coverage of the sample. Here, coverage refers to a median or average sequencing depth over all CpG sites covered by the initial set of CpG sites used in the classifier, or based on the set of anomalous fragments for a given training sample.
[0114] As an example, reference is now made to FIG. 5 illustrating a matrix of training feature vectors 522. In this example, the analytics system has identified CpG sites [K] 526 for consideration in generating feature vectors for the cancer classifier. The analytics system selects training samples [N] 524. The analytics system determines a first anomaly score 528 for a first arbitrary CpG site [kl] to be used in the feature vector for a training sample [nl], The analytics system checks each anomalous fragment in the set of anomalous fragments. If the analytics system identifies at least one anomalous fragment that includes the first CpG site, then the analytics system determines the first anomaly score 528 for the first CpG site as 1, as illustrated in FIG. 5. Considering a second arbitrary CpG site [k2], the analytics system similarly checks the set of anomalous fragments for at least one that includes the second CpG site [k2] . If the analytics system does not find any such anomalous fragment that includes the second CpG site, the analytics system determines a second anomaly score 529 for the second CpG site [k2] to be 0, as illustrated in FIG. 5. Once the analytics system determines all the anomaly scores for the initial set of CpG sites, the analytics system determines the feature vector for the first training sample [nl] including the anomaly scores with the feature vector including the first anomaly score 528 of 1 for the first CpG site [kl] and the second anomaly score 529 of 0 for the second CpG site [k2] and subsequent anomaly scores, thus forming a feature vector [1, 0, ..
[0115] The analytics system may further limit the CpG sites considered for use in the cancer classifier. The analytics system computes, for each CpG site in the initial set of CpG sites, an information gain based on the feature vectors of the training samples. Each training sample has a feature vector that may contain an anomaly score all CpG sites in the initial set of CpG sites which could include up to all CpG sites in the human genome. However, some CpG sites in the initial set of CpG sites may not be as informative as others in distinguishing between cancer types, or may be duplicative with other CpG sites.
[0116] In one embodiment, the analytics system computes an information gain for each cancer type and for each CpG site in the initial set to determine whether to include that CpG site in the classifier. The information gain is computed for training samples with a given cancer type compared to all other samples. For example, two random variables ‘anomalous fragment’ (‘AF’) and ‘cancer type’ (‘CT’) are used. In one embodiment, AF is a binary variable indicating whether there is an anomalous fragment overlapping a given CpG site in a given samples as determined for the anomaly score / feature vector above. CT is a random variable indicating whether the cancer is of a particular type. The analytics system computes the mutual information with respect to CT given AF. That is, how many bits of information about the cancer type are gained if it is known whether there is an anomalous fragment overlapping a particular CpG site. In practice, for a first cancer type, the analytics system computes pairwise mutual information gain against each other cancer type and sums the mutual information gain across all the other cancer types.
[0117] For a given cancer type, the analytics system uses this information to rank CpG sites based on how cancer specific they are. This procedure is repeated for all cancer types under consideration. If a particular region is commonly anomalously methylated in training samples of a given cancer but not in training samples of other cancer types or in healthy training samples, then CpG sites overlapped by those anomalous fragments will tend to have high information gains for the given cancer type. The ranked CpG sites for each cancer type are greedily added (selected) to a selected set of CpG sites based on their rank for use in the cancer classifier.
[0118] In additional embodiments, the analytics system may consider other selection criteria for selecting informative CpG sites to be used in the cancer classifier. One selection criterion may be that the selected CpG sites are above a threshold separation from other selected CpG sites. For example, the selected CpG sites are to be over a threshold number of base pairs away from any other selected CpG site (e.g., 100 base pairs), such that CpG sites that are within the threshold separation are not both selected for consideration in the cancer classifier.
[0119] In one embodiment, according to the selected set of CpG sites from the initial set, the analytics system may modify the feature vectors of the training samples as needed. For example, the analytics system may truncate feature vectors to remove anomaly scores corresponding to CpG sites not in the selected set of CpG sites.
[0120] With the feature vectors of the training samples, the analytics system may train the cancer classifier in any of a number of ways. In one embodiment, the analytics system trains a binary cancer classifier to distinguish between cancer and non-cancer based on the feature vectors of the training samples. In this manner, the analytics system uses training samples that include both noncancer samples from healthy individuals and cancer samples from subjects. Each training sample has one of the two labels “cancer” or “non-cancer.” In this embodiment, the classifier outputs a cancer prediction indicating the likelihood of the presence or absence of cancer.
[0121] In another embodiment, the analytics system trains a multiclass cancer classifier to distinguish between many cancer types (also referred to as tissue of origin (TOO) labels). Cancer types include one or more cancers and may include a non-cancer type (may also include any additional other diseases or genetic disorders, etc.). To do so, the analytics system uses the cancer type cohorts and may also include or not include a non-cancer type cohort. In this multi-cancer embodiment, the cancer classifier is trained to determine a cancer prediction (or, more specifically, a TOO prediction) that comprises a prediction value for each of the cancer types being classified for. The prediction values may correspond to a likelihood that a given training sample (and during inference, a test sample) has each of the cancer types. In one implementation, the prediction values are scored between 0 and 100, wherein the cumulation of the prediction values equals 100. For example, the cancer classifier returns a cancer prediction including a prediction value for breast cancer, lung cancer, and non-canccr. For example, the classifier can return a cancer prediction that a test sample is 65% likelihood of breast cancer, 25% likelihood of lung cancer, and 10% likelihood of non-cancer. The analytics system may further evaluate the prediction values to generate a prediction of a presence of one or more cancers in the sample, also may be referred to as a TOO prediction indicating one or more TOO labels, e.g., a first TOO label with the highest prediction value, a second TOO label with the second highest prediction value, etc. Continuing with the example above and given the percentages, in this example the system may determine that the sample has breast cancer given that breast cancer has the highest likelihood.
[0122] In both embodiments, the analytics system trains the cancer classifier by inputting sets of training samples with their feature vectors into the cancer classifier and adjusting classification parameters so that a function of the classifier accurately relates the training feature vectors to their corresponding label. The analytics system may group the training samples into sets of one or more training samples for iterative batch training of the cancer classifier. After inputting all sets of training samples including their training feature vectors and adjusting the classification parameters, the cancer classifier is sufficiently trained to label test samples according to their feature vector within some margin of error. The analytics system may train the cancer classifier according to any one of a number of methods. As an example, the binary cancer classifier may be a L2-regularized logistic regression classifier that is trained using a log-loss function. As another example, the multi-cancer classifier may be a multinomial logistic regression. In practice either type of cancer classifier may be trained using other techniques. These techniques are numerous including potential use of kernel methods, random forest classifier, a mixture model, an autoencoder model, machine learning algorithms such as multilayer neural networks, etc.
C2. Tuning of Disease Classifier for Group Testing
[0123] In general, the cancer classifiers are trained with individual samples. To use the classifiers on pooled group samples, in some embodiments, further tuning can be conducted.
[0124] During use of the cancer classifier, the analytics system determines a test feature vector for use by the cancer classifier. The analytics system calculates an anomaly score for each CpG site in a plurality of CpG sites in use by the cancer classifier. For example, the cancer classifier receives as input feature vectors inclusive of anomaly scores for 1 ,000 selected CpG sites. The analytics system thus determines a test feature vector inclusive of anomaly scores for the 1,000 selected CpG sites based on the set of anomalous fragments. The analytics system calculates the anomaly scores in a same manner as the training samples. In one embodiment, the analytics system defines the anomaly score as a binary score based on whether there is a hypermethylated or hypomethylated fragment in the set of anomalous fragments that encompasses the CpG site.
[0125] The analytics system then inputs the test feature vector into the cancer classifier. The function of the cancer classifier then generates a cancer prediction probability based on the classification parameters trained in the process and the test feature vector. In additional embodiments, the cancer prediction has predictions probability values for each of the many cancer types. Therefore, the analytics system may determine that the test sample is most likely to be of one of the cancer types.
[0126] For instance, the cancer prediction may be 60% likelihood of non-cancer and 40% likelihood of cancer. For an individual sample, the analytics system can use 50% as a cutoff, and thus determine that the test sample is likely not to have cancer. For a pooled sample, however, the cutoff value may need to adjusted. The adjustment can be helpful for one or more of the following reasons. First, when a plurality of individual samples are pooled, the proportion of a DNA fragment that contributes to one or more features of the classification can be diluted. The dilution can affect the prediction outcome.
[0127] Second, to reduce overall cost, in some embodiments, the sequencing depth for a pooled sample may be reduced as compared to that for an individual sample. Such a reduction of sequencing depth may be relative to a particular sample, i.e., reduced for an individual sample given the dilution. The reduction of sequencing depth, in some embodiments, can be absolute, i.e., less sequencing for a given amount of nucleotide fragments.
[0128] In some embodiments, the sequencing depth at the group stage is lower than that at the subsequent individual sample testing stage (as discussed below).
[0129] In some embodiments, the sequencing depth is calculated for each genomic locus in the entire sample (i.e., the pooled sample in the groups stage, or individual sample in the individual stage), regardless from which individual sample a particular nucleotide fragment is. Such a sequencing depth, therefore, can be conveniently referred to as “pooled sequencing depth.”
[0130] In one embodiment, the pooled sequencing depth at the group stage is at least lx less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 2x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 3x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 4x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 5x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least lOx less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 15x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 20x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 30x less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 50x less than the sequencing depth at the individual stage.
[0131] In one embodiment, the pooled sequencing depth at the group stage is at least 5% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 10% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 15% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 20% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 25% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 30% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 40% less than the sequencing depth at the individual stage. In one embodiment, the pooled sequencing depth at the group stage is at least 50% less than the sequencing depth at the individual stage. [0132] In some embodiments, the sequencing depth is calculated for each genomic locus from all nucleotide fragment originated from an individual sample (not all nucleotide fragment at the same from all samples in the pooled sample). Such a sequencing depth can be conveniently referred to as “individual sequencing depth.” If a pooled sample includes N individual samples, each at similar amounts, then the individual sequencing depth is about 1/N of the pooled sequencing depth. The individual sequencing depth may be estimated by dividing the pooled sequencing depth by N. Alternatively, the individual sequencing depth can be measured directly since all nucleotide fragments from an individual sample are labeled with a unique sample barcode (SB).
[0133] In one embodiment, the individual sequencing depth at the group stage is at least lx less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 2x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 3x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 4x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 5x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least lOx less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 15x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 20x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 30x less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 50x less than the sequencing depth at the individual stage.
[0134] In one embodiment, the individual sequencing depth at the group stage is at least 5% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 10% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 15% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 20% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 25% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 30% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 40% less than the sequencing depth at the individual stage. In one embodiment, the individual sequencing depth at the group stage is at least 50% less than the sequencing depth at the individual stage.
[0135] In one embodiment, the cutoff value determined on the basis of individual sample testing is lowered for the group stage testing. Lowering the cutoff value, in some embodiments, increases the sensitivity of cancer detection in a pooled sample. In some embodiments, the cutoff value may be decreased, as compared to the one for individual samples, by about 5%, 10%, 15%, 20%, 25%, 30%, 35% or 40%, without limitation.
[0136] In some embodiments, the adjusted cutoff value can be determined for a particular demographic population. For instance, if all samples in the group are from a particular geographic (e.g., downtown Detroit) area or a particular age/gender group (e.g., retired male autoworkers), then certain cancer statistics of the demographics can be used for adjusting the cutoff value.
[0137] In one example, the overall cancer rate of the demographic group is used for determining the cutoff value. For example, at downtown Detroit, the overall cancer rate may be 1,000 cancer cases per 100,000 people. The cutoff value for such a cancer rate can be calculated and applied to the group. Calculation of a suitable cutoff value for a particular cancer rate, for instance, can be done with one or more simulation experiments, as demonstrated in the Experimental Examples.
[0138] In another example, a suitable cutoff value is determined with actual samples (or actual data) taken from and representing the demographic population. For instance, for downtown Detroit, samples of a number of residents e.g. , 100, 500, or 1 ,000) may be used to run a simulation test to determine a suitable cutoff that optimizes sensitivity, specification, and number of runs. In another example, provided that the sequencing data are already available, they can be used for a virtual simulation to determine a suitable cutoff value.
[0139] In additional embodiments, it can be helpful, but not necessary, at the group stage, to call the pooled sample as having a particular cancer type. For example, the classifier can return a cancer prediction that a group sample is 55% likelihood of breast cancer, 15% likelihood of lung cancer, and 20% likelihood of non-cancer. In this example the system may determine that the sample has breast cancer given that breast cancer has the highest likelihood.
D. Individual Stage Classification
[0140] At the group testing stage, a pooled group sample that includes no individual cancer sample is likely classified as non-cancer. In that case, no further testing is required, and the cancer detection for the entire group concludes, with a predication output that all samples in the group have no cancer. In a case where the predicted probability of having cancer is higher than the cutoff value, a prediction output comes back to indicate that at least an individual sample in the group has cancer.
[0141] The one or more individual cancer samples in the pooled sample need to be identified. In one example, additional sample processing of each individual sample in the pooled sample may not be necessary. As noted earlier, each nucleotide fragment in a pooled sample is ligated to a unique sample barcode (SB). In the classification model, it is possible to identify nucleotide fragments (e.g., with hypermethylation or hypomethylation) that contributed the most to the positive cancer prediction. If those fragments are commonly labeled with a sample barcode for a particular sample, then it is likely that that particular sample is the cancer sample.
[0142] In an alternative and preferred embodiment, however, each individual sample in the pooled sample is subjected to another round of analysis, which can be referred to as “individual stage classification.”
[0143] The individual stage classification in similar to the group stage classification in many aspects. For instance, both entail ligation of unique molecule identifiers (UMIs), fragment enrichment, methylation- specific modification and sequencing, and classification.
[0144] In some embodiments, there are two important differences between the group stage classification and the individual stage classification. The first is that the individual stage classification uses higher cancer probability threshold (cutoff value) than the group stage classification. For instance, for a classification, a 40% cutoff value may be used at the group stage, and a 50% cutoff value may be used at the individual stage. [0145] The second difference, in some embodiments, is on sequencing depth, which is discussed in more detail in the previous section. Overall, the group stage sequencing has lower sequencing depth at least with respect to each individual sample than the individual stage, which contributes to added saving of time and expenses.
[0146] As demonstrated in the accompanying experimental examples, at least because a final cancer prediction includes confirmation from each individual cancer sample, the instant technology retains high specificity (low false positive rates) as compared to the conventional technology.
E. Example Embodiments
[0147] Certain steps of some of the embodiments of the present technology are illustrated in FIG. 6. The figure shows a process 600 that includes two stages of cancer classification, a group stage one 610, and an individual stage one 620. In the group stage, a sample is obtained from an individual that desired cancer diagnosis. The sample can be, without limitation, blood, plasma, serum, semen, milk, urine, saliva or cerebral spinal fluid. Each sample includes a plurality of nucleic acid (NA) fragments. In particular, the NA fragments are cfDNA fragments.
[0148] In a first, optional step in the group stage, the samples can be treated to modify unmethylated cytosines in the NA fragments (651), to enable methylation detection subsequently. For example, in one embodiment, the sample can be treated with bisulfite ion (e.g., using sodium bisulfite) to convert unmethylated cytosines (“C”) to uracils (“U”). In another embodiment, the conversion of unmethylated cytosines to uracils is accomplished using an enzymatic conversion reaction, for example, using a cytidine deaminase, such as APOBEC-Seq (NEBiolabs, Ipswich, MA).
[0149] Each NA fragment from a sample can be ligated to a sample- specific barcode (SB) (653). An example process for SB-labeling is illustrated in FIG. 4. Optionally, each NA fragment can be further ligated to a unique molecule identifier,” or “UMI”, and/or enriched.
[0150] The SB-labeled NA fragments from each sample can then be pooled together to prepare a “pooled sample” (655). The pooled sample can then be subjected to sequencing (657). As noted above, without limitation, the sequencing may be Sanger sequencing, fragment analysis, or next- generation sequencing. Following sequencing, the sequenced fragments (sequence reads) can be aligned to a reference genome (e.g., human reference genome assembly hgl9 or hg38). Such alignment can help determine the locus of each sequence read. For methylation detection, the alignment can also help determine the methylation status of each cytosine from the original, unmodified fragments.
[0151] A sequencing process, in particular the deep sequencing, can reach a certain level of depths, as described above. Generally, higher sequencing depths requires longer time, may need more sample, and is more expensive. In one embodiment, the sequencing depth (Dp) for each sample within the pooled sample at this group stage can be lower than that in the subsequent individual stage (Di), or the total Dp/N < Di, where N represents the number of samples in the pooled sample, assuming that each sample has similar amounts of NA fragments. The relative values of DP and DI are discussed in more details above in Section C2. In some embodiments, Dp is at least 5%, 10%, 20%, 30%, 40%, or 50% lower than N x Di.
[0152] The sequence reads, along with the locus information and/or methylation status, can be fed into a classification model to make a prediction, e.g., to generate a cancer probability score (661). Training and application of cancer classification models are as discussed above. In one example, a cancer probability score is a percentage, such as 40% (or a pair of percentages, such as 60% likely non-cancer and 40% cancer).
[0153] To make a prediction, the cancer probability score obtained in step 661 can be compared to a predetermined cutoff value (Cp) (663). Again as discussed in more details in the disclosure, the cutoff value (Cp) at the group stage can be different from that at the subsequent individual stage (Ci). A conventional cancer classification cutoff value is set for individual patient samples. Further adjustment may be need to apply it to pooled samples. In some embodiments, the adjustment takes as input a cancer incident rate associated with the individuals. In some embodiments, the Cp is determined with actual samples from individuals representing the demographic of the samples being tested. Further of these embodiments have been discussed above.
[0154] If the cancer probability score obtained in step 661 is lower than the pooled cutoff value (Cp), then the system can make a prediction that none of the samples in the pooled sample is from a cancer patient (630). On the other hand, if the cancer probability score obtained in step 661 is greater than the pooled cutoff value (Cp) (640), then the procedure moves to the individual stage (620) - each sample in the pooled sample is individually tested for its cancer status.
[0155] The individual stage is to certain extent similar to the group stage, including an optional cytosine modification step (665), a sequencing step (667) to a sequencing depth of Di, an alignment step (669), a classification step to calculate cancer probability score (671), and a prediction step with an individual cutoff value Ci (673). As noted above, the DI and CI, respectively, may be different from the Dp and Dp used in the group stage.
[0156] If the cancer probability score is greater than the individual cutoff value Ci, then the sample can be predicted as having cancer. In some embodiments, the classification makes a further prediction with respect to the type or subtype of the cancer.
CLINICAL APPLICATIONS
[0157] In some embodiments, the methods, analytic systems and/or classifier of the present invention can be used to detect the presence of cancer, monitor cancer progression or recurrence, monitor therapeutic response or effectiveness, determine a presence or monitor minimum residual disease (MRD), or any combination thereof. For example, as described herein, a classifier can be used to generate a probability score (e.g., from 0 to 100) describing a likelihood that a test feature vector is from a subject with cancer. In some embodiments, a cancer detection is made as to whether or not the subject has cancer. In other embodiments, the cancer detection can be made at multiple different time points (e.g., before or after treatment) to monitor disease progression or to monitor treatment effectiveness (e.g., therapeutic efficacy). In still other embodiments, the cancer detection can be used to make or influence a clinical decision (e.g., diagnosis of cancer, treatment selection, assessment of treatment effectiveness, etc.). For example, in one embodiment, following the cancer detection, a physician can prescribe an appropriate treatment.
[0158] Examples of cancers that can be detected using the methods, systems and classifiers of the present invention include carcinoma, lymphoma, blastoma, sarcoma, and leukemia or lymphoid malignancies. More particular examples of such cancers include, but are not limited to, squamous cell cancer (e.g., epithelial squamous cell cancer), skin carcinoma, melanoma, lung cancer, including small-cell lung cancer, non-small cell lung cancer (“NSCLC”), adenocarcinoma of the lung and squamous carcinoma of the lung, cancer of the peritoneum, gastric or stomach cancer including gastrointestinal cancer, pancreatic cancer (e.g., pancreatic ductal adenocarcinoma), cervical cancer, ovarian cancer (e.g., high grade serous ovarian carcinoma), liver cancer (e.g., hepatocellular carcinoma (HCC)), hepatoma, hepatic carcinoma, bladder cancer (e.g., urothelial bladder cancer), testicular (germ cell tumor) cancer, breast cancer (e.g., HER2 positive, HER2 negative, and triple negative breast cancer), brain cancer (e.g., astrocytoma, glioma (e.g., glioblastoma)), colon cancer, rectal cancer, colorectal cancer, endometrial or uterine carcinoma, salivary gland carcinoma, kidney or renal cancer (e.g., renal cell carcinoma, nephroblastoma or Wilms’ tumor), prostate cancer, vulval cancer, thyroid cancer, anal carcinoma, penile carcinoma, head and neck cancer, esophageal carcinoma, and nasopharyngeal carcinoma (NPC). Additional examples of cancers include, without limitation, retinoblastoma, thecoma, arrhenoblastoma, hematological malignancies, including but not limited to non-Hodgkin's lymphoma (NHL), multiple myeloma and acute hematological malignancies, endometriosis, fibrosarcoma, choriocarcinoma, laryngeal carcinomas, Kaposi's sarcoma, Schwannoma, oligodendroglioma, neuroblastomas, rhabdomyosarcoma, osteogenic sarcoma, leiomyosarcoma, and urinary tract carcinomas.
[0159] In some embodiments, the cancer is one or more of anorectal cancer, bladder cancer, breast cancer, cervical cancer, colorectal cancer, esophageal cancer, gastric cancer, head & neck cancer, hepatobiliary cancer, leukemia, lung cancer, lymphoma, melanoma, multiple myeloma, ovarian cancer, pancreatic cancer, prostate cancer, renal cancer, thyroid cancer, uterine cancer, or any combination thereof.
[0160] In some embodiments, the cancer prediction can be assessed at multiple different time points (e.g., or before or after treatment) to monitor disease progression or to monitor treatment effectiveness (e.g., therapeutic efficacy). For example, the present invention include methods that involve obtaining a first sample (e.g. , a first plasma cfDNA sample) from a cancer patient at a first time point, determining a first cancer prediction therefrom (as described herein), obtaining a second test sample (e.g., a second plasma cfDNA sample) from the cancer patient at a second time point, and determining a second cancer prediction therefrom (as described herein).
[0161] In certain embodiments, the first time point is before a cancer treatment (e.g., before a resection surgery or a therapeutic intervention), and the second time point is after a cancer treatment (e.g., after a resection surgery or therapeutic intervention), and the classifier is utilized to monitor the effectiveness of the treatment. For example, if the second cancer prediction decreases compared to the first cancer prediction , then the treatment is considered to have been successful. However, if the second cancer prediction increases compared to the first cancer prediction , then the treatment is considered to have not been successful. In other embodiments, both the first and second time points are before a cancer treatment (e.g., before a resection surgery or a therapeutic intervention). In still other embodiments, both the first and the second time points are after a cancer treatment (e.g., after a resection surgery or a therapeutic intervention). In still other embodiments, cfDNA samples may be obtained from a cancer patient at a first and second time point and analyzed, e.g., to monitor cancer progression, to determine if a cancer is in remission (e.g., after treatment), to monitor or detect residual disease or recurrence of disease, or to monitor treatment (e.g., therapeutic) efficacy.
[01621 Those of skill in the art will readily appreciate that test samples can be obtained from a cancer patient over any desired set of time points and analyzed in accordance with the methods of the invention to monitor a cancer state in the patient. In some embodiments, the first and second time points are separated by an amount of time that ranges from about 15 minutes up to about 30 years, such as about 30 minutes, such as about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or about 24 hours, such as about 1, 2, 3, 4, 5, 10, 15, 20, 25 or about 30 days, or such as about 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 , or 12 months, or such as about 1 , 1 .5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5, 15, 15.5, 16, 16.5, 17, 17.5, 18, 18.5, 19, 19.5, 20, 20.5, 21, 21.5, 22, 22.5, 23, 23.5, 24, 24.5, 25, 25.5, 26, 26.5, 27, 27.5, 28, 28.5, 29, 29.5 or about 30 years. In other embodiments, test samples can be obtained from the patient at least once every 3 months, at least once every 6 months, at least once a year, at least once every 2 years, at least once every 3 years, at least once every 4 years, or at least once every 5 years.
[0163] In still another embodiment, the cancer prediction can be used to make or influence a clinical decision (e.g., diagnosis of cancer, treatment selection, assessment of treatment effectiveness, etc.). For example, in one embodiment, if the cancer prediction (e.g., for cancer or for a particular cancer type) exceeds a threshold, a physician can prescribe an appropriate treatment (e.g., a resection surgery, radiation therapy, chemotherapy, and/or immunotherapy). [0164] In some embodiments, the treatment is one or more cancer therapeutic agents selected from the group consisting of a chemotherapy agent, a targeted cancer therapy agent, a differentiating therapy agent, a hormone therapy agent, and an immunotherapy agent. For example, the treatment can be one or more chemotherapy agents selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, anti-tumor antibiotics, cytoskeletal disruptors (taxans), topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, platinum-based agents and any combination thereof. In some embodiments, the treatment is one or more targeted cancer therapy agents selected from the group consisting of signal transduction inhibitors (e.g. tyrosine kinase and growth factor receptor inhibitors), histone deacetylase (HD AC) inhibitors, retinoic receptor agonists, proteosome inhibitors, angiogenesis inhibitors, and monoclonal antibody conjugates. In some embodiments, the treatment is one or more differentiating therapy agents including retinoids, such as tretinoin, alitretinoin and bexarotene. In some embodiments, the treatment is one or more hormone therapy agents selected from the group consisting of anti-estrogens, aromatase inhibitors, progestins, estrogens, anti-androgens, and GnRH agonists or analogs. In one embodiment, the treatment is one or more immunotherapy agents selected from the group comprising monoclonal antibody therapies such as rituximab (RITUXAN) and alemtuzumab (CAMPATH), non-specific immunotherapies and adjuvants, such as BCG, interleukin-2 (IL-2), and interferon-alfa, immunomodulating drugs, for instance, thalidomide and lenalidomide (REVLIMID). It is within the capabilities of a skilled physician or oncologist to select an appropriate cancer therapeutic agent based on characteristics such as the type of tumor, cancer stage, previous exposure to cancer treatment or therapeutic agent, and other characteristics of the cancer.
EXPERIMENTAL EXAMPLES
[0165] Simulation experiments were conducted to evaluate the performance of group testing in terms of efficiency, sensitivity and specificity.
[0166] In a first experiment, the group size was set at 2, and the cancer rate is set at 1% (0.01). For 100 samples, 50 groups were generated, and only one of the 100 samples had cancer. In the group stage, if none of the groups are filtered out (FracFiltered = 0), then all 100 samples were individually tested. In this case, the total number of testing/predications needed was 150 (NumPredictsPerlOO = 50 group testing + 100 individual test = 150). In this example, the sensitivity was 54.20% and the specificity was 99.37%. A probability of cancer given positive result (PPV) was then calculated as shown below:
[0167] The simulation results are shown in Table 1.
Table 1. Simulation results - GroupSize = 2, CancerRate = 0.01
[0168] Apparently, in all of the eight simulations, the specificity remained high. When the number of groups being filtered out at the group stage increased, the total number of testing decreased, leading to significant saving of time and expense.
[0169] Adjustment of the classification parameters was tested to see if the sensitivity could be improved. As shown in Table 2, higher sensitivity could indeed be achieved with such adjustment. Again, such adjustment would not significantly lower the specificity given that all cancer detection required confirmation at the individual sample level.
Table 2. Improvement of Sensitivity
[0170] Similar simulations were conducted for different group sizes and cancer rates, and the simulation results are shown in Tables 3-5.
Table 3. Simulation results - GroupSize - 4, CancerRate = 0.01
Table 4. Simulation results - GroupSize - 8, CancerRate = 0.01
Table 5. Simulation results - GroupSize = 8, CancerRate = 0.001
[0171] As shown, all simulations achieved high specificity with good sensitivity, yet leading to significantly reduced number of testing/predictions; hence significant saving of time and expenses.
ADDITIONAL CONSIDERATIONS
[0172] The foregoing detailed description of embodiments refers to the accompanying drawings, which illustrate specific embodiments of the present disclosure. Other embodiments having different structures and operations do not depart from the scope of the present disclosure. The term “the invention” or the like is used with reference to certain specific examples of the many alternative aspects or embodiments of the applicants’ invention set forth in this specification, and neither its use nor its absence is intended to limit the scope of the applicants’ invention or the scope of the claims.
[0173] Embodiments of the invention may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non-transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
[0174] Any of the steps, operations, or processes described herein as being performed by the analytics system may be performed or implemented with one or more hardware or software modules of the apparatus, alone or in combination with other computing devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.

Claims

CLAIMS WHAT IS CLAIMED IS:
1. A method for identifying a sample as from a subject having cancer, comprising:
(A) pooling nucleic acid (NA) fragments of a number (N) of samples to generate a pooled sample, wherein each sample is from a subject; sequencing the NA fragments in the pooled sample at a pooled sequencing depth Dp; aligning the sequenced NA fragments to a reference genome to obtain a location for each NA fragment; feeding the aligned sequences and the locations to a cancer classification model to obtain a pooled cancer probability score; comparing the pooled cancer probability score to a pooled cutoff value Cp; and when the pooled cancer probability score is greater than the Cp, subjecting each sample to the steps in (B),
(B) sequencing the NA fragment in the sample at an individual sequencing depth Di; aligning the sequenced NA fragments to the reference genome to obtain a location for each NA fragment; feeding the aligned sequences and the locations to the cancer classification model to obtain an individual cancer probability score; comparing the individual cancer probability score to an individual cutoff value Ci; and when the individual cancer probability score is greater than the Ci, identifying the sample as from a subject having cancer; wherein the Cp is lower than CL
2. The method of claim 1, wherein the Dp is lower than N x Di.
3. The method of claim 2, wherein the Dp is at least 10%, 20% or 30% lower than N x Di.
4. The method of any preceding claim, wherein the Cp is calculated with a cancer incidence rate associated with the demographic of the subjects.
5. The method of claim 4, wherein the demographic comprises one or more selected from the group consisting of geographic location, gender, age, race, medical history, and employment history.
6. The method of claim 4, wherein the Cp is calculated with samples of individuals in the demographic.
7. The method of any preceding claim, wherein each NA fragment in the pooled sample is ligated to a sample barcode (SB) that is unique to the sample from which the NA fragment is obtained.
8. The method of claim 7, further comprising, in step (A), identifying NA fragments contributing significantly to the pooled cancer probability score, and one or more sample barcodes ligated to the identified NA fragments.
9. The method of any preceding claim, further comprising, in each of steps (A) and (B), modifying unmethylated cytosine, prior to sequencing.
10. The method of claim 9, further comprising, in each of steps (A) and (B), identifying the methylation status of one or more of the NA fragments.
11. The method of claim 10, wherein the methylation status is conversion of a cytosine to a 5-methylcytosine (5-mC) or to a 5-hydroxymethylcytosine (5-hmC).
12. The method of claim 9, wherein the modifying unmethylated cytosine is done with bisulfite or enzymatic treatment.
13. The method of any preceding claim, wherein the sequencing in each of steps (A) and (B) is deep sequencing.
14. The method of any preceding claim, wherein the step (B) further comprises identifying a tissue origin of the cancer.
15. The method of any preceding claim, wherein each sample comprises blood, plasma, serum, semen, milk, urine, saliva or cerebral spinal fluid, acquired from a human subject.
16. The method of any preceding claim, wherein the NA fragments are cell-free DNA fragments.
PCT/US2025/012431 2024-01-22 2025-01-21 Disease classification with group testing Pending WO2025160074A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202463623693P 2024-01-22 2024-01-22
US63/623,693 2024-01-22

Publications (1)

Publication Number Publication Date
WO2025160074A1 true WO2025160074A1 (en) 2025-07-31

Family

ID=94598639

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2025/012431 Pending WO2025160074A1 (en) 2024-01-22 2025-01-21 Disease classification with group testing

Country Status (1)

Country Link
WO (1) WO2025160074A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200365229A1 (en) * 2019-05-13 2020-11-19 Grail, Inc. Model-based featurization and classification
WO2021202970A1 (en) * 2020-04-02 2021-10-07 The Broad Institute, Inc. Sequencing-based population scale screening
US20230132951A1 (en) * 2016-10-24 2023-05-04 The Chinese University Of Hong Kong Methods and systems for tumor detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230132951A1 (en) * 2016-10-24 2023-05-04 The Chinese University Of Hong Kong Methods and systems for tumor detection
US20200365229A1 (en) * 2019-05-13 2020-11-19 Grail, Inc. Model-based featurization and classification
WO2021202970A1 (en) * 2020-04-02 2021-10-07 The Broad Institute, Inc. Sequencing-based population scale screening

Similar Documents

Publication Publication Date Title
US20210310075A1 (en) Cancer Classification with Synthetic Training Samples
WO2021202423A1 (en) Cancer classification with genomic region modeling
EP4193360A2 (en) Sample validation for cancer classification
US20250061963A1 (en) Dynamically selecting sequencing subregions for cancer classification
US20240412821A1 (en) Methylation-based biological sex prediction
US20240312564A1 (en) White blood cell contamination detection
US20240170099A1 (en) Methylation-based age prediction as feature for cancer classification
WO2025160074A1 (en) Disease classification with group testing
US20240055073A1 (en) Sample contamination detection of contaminated fragments with cpg-snp contamination markers
US20230272477A1 (en) Sample contamination detection of contaminated fragments for cancer classification
US20240309461A1 (en) Sample barcode in multiplex sample sequencing
US20240296920A1 (en) Redacting cell-free dna from test samples for classification by a mixture model
US20240233872A9 (en) Component mixture model for tissue identification in dna samples

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 25704745

Country of ref document: EP

Kind code of ref document: A1