[go: up one dir, main page]

WO2020163410A1 - Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse - Google Patents

Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse Download PDF

Info

Publication number
WO2020163410A1
WO2020163410A1 PCT/US2020/016684 US2020016684W WO2020163410A1 WO 2020163410 A1 WO2020163410 A1 WO 2020163410A1 US 2020016684 W US2020016684 W US 2020016684W WO 2020163410 A1 WO2020163410 A1 WO 2020163410A1
Authority
WO
WIPO (PCT)
Prior art keywords
cancer
list
genomic regions
target genomic
composition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2020/016684
Other languages
English (en)
Inventor
Oliver Claude VENN
Alexander P. FIELDS
Samuel S. Gross
Qinwen LIU
Jan Schellenberger
Joerg Bredno
John F. BEAUSANG
Seyedmehdi SHOJAEE
Onur Sakarya
M. Cyrus MAHER
Arash Jamshidi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Grail Inc
Original Assignee
Grail Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/US2020/015082 external-priority patent/WO2020154682A2/fr
Priority claimed from PCT/US2020/016673 external-priority patent/WO2020163403A1/fr
Priority to EP20752248.3A priority Critical patent/EP3921444B1/fr
Priority to CN202080025351.1A priority patent/CN114026255B/zh
Priority to ES20752248T priority patent/ES2993312T1/es
Priority to AU2020219853A priority patent/AU2020219853A1/en
Application filed by Grail Inc filed Critical Grail Inc
Priority to CA3129043A priority patent/CA3129043A1/fr
Priority to EP24204942.7A priority patent/EP4502178A3/fr
Publication of WO2020163410A1 publication Critical patent/WO2020163410A1/fr
Priority to IL285316A priority patent/IL285316A/en
Priority to US17/393,625 priority patent/US20220098672A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6816Hybridisation assays characterised by the detection means
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/16Primer sets for multiplex assays

Definitions

  • DNA methylation plays an important role in regulating gene expression. Aberrant DNA methylation has been implicated in many disease processes, including cancer. DNA methylation profiling using methylation sequencing (e.g., whole genome bisulfite sequencing (WGBS)) is increasingly recognized as a valuable diagnostic tool for detection, diagnosis, and/or monitoring of cancer. For example, specific patterns of differentially methylated regions may be useful as molecular markers for various diseases.
  • WGBS whole genome bisulfite sequencing
  • WGBS is not ideally suitable for a product assay. The reason is that the vast majority of the genome is either not differentially methylated in cancer, or the local CpG density is too low to provide a robust signal. Only a few percent of the genome is likely to be useful in classification.
  • determining differentially methylated regions in a disease group only holds weight in comparison with a group of control subjects, such that if the control group is small in number, the determination loses confidence with the small control group.
  • methylation status can vary which can be difficult to account for when determining whether the regions are differentially methylated in a disease group.
  • methylation of a cytosine at a CpG site is strongly correlated with methylation at a subsequent CpG site. To encapsulate this dependency is a challenge in itself.
  • compositions comprising a plurality of different bait oligonucleotides, wherein the plurality of different bait oligonucleotides are configured to collectively hybridize to DNA molecules derived from at least 100 target genomic regions and wherein each genomic region of the at least 100 target genomic regions is differentially methylated in at least one cancer type relative to another cancer type or relative to a non-cancer type.
  • the at least 100 target genomic regions comprise at least one, at least 5, at least 10, at least 20, at least 50, or at least 100 target genomic regions that are differentially methylated in at least a first cancer type relative to a second cancer type and relative to a non-cancer type.
  • the at least 100 target genomic regions comprise at least one target genomic region that is differentially methylated in the first cancer type relative to two or more, three or more, four or more, five or more, or ten or more, twelve or more, or fifteen or more other cancer types. In some embodiments, the at least 100 target genomic regions comprise, for all possible pairs between the one cancer type and at least 10, at least 12, at least 15 or at least 18 other cancer types or the non-cancer type, at least one target genomic region that is differentially methylated between the pair of cancer types.
  • the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of any one of Lists 1-49. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of Lists 1-49. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% or at least 40% of the target genomic regions of any one of Lists 1-15. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% or at least 40% of the target genomic regions of Lists 1-15.
  • the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% the target genomic regions of any one of Lists 16-32. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of Lists 16-32. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of any one of Lists 33-49. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of Lists 33-49.
  • compositions comprising a plurality of different bait oligonucleotides configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of any one of Lists 1-49.
  • the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of Lists 1-49.
  • the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% or at least 40% of the target genomic regions of any one of Lists 1-15.
  • the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% or at least 40% of the target genomic regions of Lists 1-15. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% the target genomic regions of any one of Lists 16-32. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of Lists 16-32. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of any one of Lists 33-49. In some embodiments, the plurality of bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of Lists 33-49.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 1.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 1.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 2.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 2.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 3.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 3.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 4.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 4.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 5.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 5.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 6.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 6.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 7.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 7.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 8.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 8.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 9.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 9.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 10. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 10.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 11. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 11.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 12. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 12.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 13. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 13.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 14. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 14.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 15. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 15.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 16. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 16.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 17. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 17.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 18. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 18.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 19. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 19.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 20.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 20.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 21. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 21. [0031] In some embodiments, the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 22. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 22.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 23. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 23.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 24.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 24.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 25.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 25.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 26. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 26.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 27. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 27.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 28. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 28.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 29. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 29.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 30.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 30.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 31.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 31.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 32.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 32.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 33.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 33.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 34. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 34.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 35. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 35.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 36.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 36.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 37.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 37.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 38. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 38. [0048] In some embodiments, the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 39. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 39.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 40.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 40.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 41.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 41.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 42.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 42.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 43.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 43.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 44.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 44.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 45. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 45.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 46. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 46.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 47. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 47.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 48. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 48.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of List 49. In some embodiments, the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of List 49.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions from any two or more, three or more, four or more, or five or more of Lists 16-32.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions from any two or more, three or more, four or more, or five or more, six or more, seven or more, eight or more, nine or more, or ten or more of Lists 16-32.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions from any two or more, three or more, four or more, or five or more of Lists 33-49.
  • the DNA molecules are derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions from any two or more, three or more, four or more, five or more, six or more, seven or more, eight or more, nine or more, or ten or more of Lists 33-49.
  • the total size of the of the target genomic regions is less than 1100 kb, less than 750 kb, less than 270 kb, less than 200 kb, less than 150 kb, less than 100 kb, or less than 50 kb. In some embodiments, the total number of target genomic regions is less than 1700, less than 1300, less than 900, less than 700 or less than 400.
  • the total size of the targeted genomic regions is less than 5,000 bk, 2,500 kb, less than 2,000 kb, less than 1,500 kb, less than 1,000 kb, less than 750 kb, or less than 500 kb. In some embodiments, the total number of targeted genomic regions is less than 20,000, less than 18,000, less than 16,000, less than 14,000, less than 12,000, less than 10,000, less than 8,000, less than 6,000, less than 4,000, or less than 2,000.
  • the DNA molecules are converted cfDNA fragments.
  • the target genomic regions are hypermethyl ated regions, hypomethylated regions, or binary regions that can be either hypermethylated or hypomethylated, as indicated in the sequence listing.
  • the bait oligonucleotides are configured to hybridize to hypermethylated converted DNA molecules, hypomethylated converted DNA molecules, or both hypermethylated and hypomethylated converted DNA molecules derived from each targeted genomic region, as indicated in the sequence listing.
  • the bait oligonucleotides are each conjugated to an affinity moiety.
  • the affinity moiety is biotin.
  • the bait oligonucleotides are each conjugated to a solid surface.
  • the solid surface is a microarray or chip.
  • the bait oligonucleotides each have a length of 45 to 300 nucleotide bases, 75-200 nucleotide bases, 100-150 nucleotide bases, or about 120 nucleotide bases.
  • the bait oligonucleotides comprise a plurality of sets of two or more bait oligonucleotides, wherein each bait oligonucleotide within a set of bait
  • oligonucleotides is configured to bind to the same converted target genomic region or configured to bind to a nucleic acid molecule derived from the target genomic region.
  • each set of bait oligonucleotides comprises 1 or more pairs of a first bait oligonucleotide and a second bait oligonucleotide, wherein each bait oligonucleotide comprises a 5’ end and a 3’ end, wherein a sequence of at least X nucleotide bases at the 3’ end of the first bait oligonucleotide is identical to a sequence of X nucleotide bases at the 5’ end the second bait oligonucleotide, and wherein X is at least 25, 30, 35, 40, 45, 50, 60, 70, 75 or 100.
  • the first bait oligonucleotide comprises a sequence of at least 31, 40, 50 or 60 nucleotide bases that does not overlap a sequence of the second bait oligonucleotide.
  • the composition further comprises converted cfDNA from a test subject.
  • the cfDNA from the test subject is converted by a process comprising treatment with bisulfite or a cytosine deaminase.
  • the trained classifier is a mixture model classifier. In some embodiments, the classifier was trained on converted DNA sequences derived from at least 1000, at least 2000, or at least 4000 target genomic regions selected from any one of Lists 1-49.
  • the trained classifier determines the presence or absence of cancer or a cancer type by: (i) generating a set of features for the sample, wherein each feature in the set of features comprises a numerical value; (ii) inputting the set of features into the classifier, wherein the classifier comprises a multinomial classifier; (iii) based on the set of features, determining, at the classifier, a set of probability scores, wherein the set of probability scores comprises one probability score per cancer type class and per non-cancer type class; and (iv) thresholding the set of probability scores based on one or more values determined during training of the classifier to determine a final cancer classification of the sample.
  • the set of features comprises a set of binarized features.
  • the numerical value comprises a single binary value.
  • the multinomial classifier comprises a multinomial logistic regression ensemble trained to predict a source tissue for the cancer.
  • the method further comprises determining the final cancer classification based on a top-two probability score differential relative to a minimum value, wherein the minimum value corresponds to a predefined percentage of training cancer samples that had been assigned the correct cancer type as their highest score during training of the classifier.
  • (i) in accordance with a determination that the top-two probability score differential exceeds the minimum value assign a cancer label corresponding to the highest probability score determined by the classifier as the final cancer classification; and (ii) in accordance with a determination that the top-two probability score differential does not exceed the minimum value, assigning an indeterminate cancer label as the final cancer classification.
  • the type of cancer is selected from the group consisting of anorectal cancer, bladder cancer, bladder and urothelial cancer, breast cancer, cervical cancer, colorectal cancer, head and neck cancer, hepatobiliary cancer, liver and bile duct cancer, lung cancer, melanoma, ovarian cancer, pancreatic cancer, pancreatic and gall bladder cancer, prostate cancer, renal cancer, sarcoma, thyroid cancer, upper GI cancer, and uterine cancer.
  • the capture cfDNA fragments are converted cfDNA fragments.
  • cancer assay panels comprising: at least 5 pairs of probes, wherein each pair of the at least 5 pairs comprise two probes configured to overlap each other by an overlapping sequence, wherein the overlapping sequence comprises a sequence of at least 30 nucleotides, wherein the at least 30-nucleotide sequence is configured to hybridize to a converted cfDNA molecule corresponding to, or derived from one or more of genomic regions, wherein each of the genomic regions comprises at least five methylation sites, wherein the at least five methylation sites have an abnormal methylation pattern in first cancerous samples, and wherein each probe of the of the at least 5 pairs of probes comprises a non-overlapping sequence of at least 31 nucleotides.
  • the cancer assay panels comprise at least 10, at least 20, at least 30, at least 50, at least 100, at least 200, or at least 500 pairs of probes.
  • the genomic regions are selected from a List, and the list is List 1 and the first cancerous samples are samples from subject having bladder cancer, the list is List 2 and the first cancerous samples are samples from subject having breast cancer, the list is List 3 and the first cancerous samples are samples from subject having cervical cancer, the list is List 4 and the first cancerous samples are samples from subject having colorectal cancer, the list is List 5 and the first cancerous samples are samples from subject having head and neck cancer, the list is List 6 and the first cancerous samples are samples from subject having hepatobiliary cancer, the list is List 7 and the first cancerous samples are samples from subject having lung cancer, the list is List 8 and the first cancerous samples are samples from subject having melanoma, the list is List 9 and the first cancerous samples are samples from subject having ovarian cancer, the list is List 10 and the first cancerous samples are samples from subject having pancreatic cancer, the list is List 11 and the first cancerous samples are samples from subject having prostate cancer, the list is List 12 and the first
  • the genomic regions are selected from a List
  • the list is List 16 or List 33 and the first cancerous samples are samples from subject having anorectal cancer
  • the list is List 17 or List 34 and the first cancerous samples are samples from subject having bladder or urothelial cancer
  • the list is List 18 or List 35 and the first cancerous samples are samples from subject having breast cancer
  • the list is List 19 or List 36 and the first cancerous samples are samples from subject having cervical cancer
  • the list is List 20 or List 37 and the first cancerous samples are samples from subject having colorectal cancer
  • the list is List 21 or List 38 and the first cancerous samples are samples from subject having head or neck cancer
  • the list is List 22 or List 39 and the first cancerous samples are samples from subject having liver or bile duct cancer
  • the list is List 23 or List 40 and the first cancerous samples are samples from subject having lung cancer
  • the list is List 24 or List 41 and the first cancerous samples are samples from subject having melanoma
  • the list is List 25 or List 42 and the first cancerous
  • the genomic regions comprise at least 20%, 30%, 40%, 50%,
  • the genomic regions comprise at least 30, 53, 103, 159, 160, 200, 250, 300, 400, 500, 600, 800, or 1,000 genomic regions in the List.
  • the converted cfDNA molecules comprise cfDNA molecules treated to covert unmethylated C (cytosine) to U (uracil).
  • each of the at least 5 pairs of probes is conjugated to a non-nucleotide affinity moiety.
  • the non-nucleotide affinity moiety is a biotin moiety.
  • the abnormal methylation pattern has at least a threshold p-value rarity in the first cancerous samples.
  • each of the probes is designed to have sequence homology or sequence complementarity with less than 20 off-target genomic regions. In some embodiments, the less than 20 off-target genomic regions are identified using a k-mer seeding strategy. In some embodiments, the less than 20 off-target genomic regions are identified using k-mer seeding strategy combined to local alignment at seed locations. In some embodiments, each of the probes comprises at least 61, 75, 100, 120, or 121 nucleotides. In some embodiments, each of the probes comprises less than 300, 250, 200, 160 or 159 nucleotides.
  • each of the probes comprises 100-159 or 100-160 nucleotides. In some embodiments, each of the probes comprises less than 20, 15, 10, 8, or 6 methylation sites. In some embodiments, at least 80, 85, 90, 92, 95, or 98% of the at least five methylation sites are either methylated or unmethylated in the cancerous samples. In some embodiments, at least 3%, 5%, 10%, 15%, or 20% of the probes comprise no G (Guanine). In some embodiments, each of the probes comprise multiple binding sites to the methylation sites of the converted cfDNA molecule, wherein at least 80, 85, 90, 92, 95, or 98% of the multiple binding sites comprise exclusively either CpG or CpA. In some embodiments, each of the probes is configured to have sequence homology or sequence complementarity with less than 15, 10 or 8 off-target genomic regions.
  • the cancer assay panel comprises at least 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1,200, 1,400, 1,600, 1,800,
  • the at least 5 pairs of probes together comprise at least 10,000, 20,000, 30,000, 40,000, 50,000, 60,000,
  • TOO cancer tissue of origin
  • TOO cancer tissue of origin
  • the method further comprises the step of: determining a health condition by evaluating the set of sequence reads, wherein the health condition is (a) a presence or absence of cancer; (b) a stage of cancer; (c) a presence or absence of a cancer tissue of origin (TOO); (d) a presence or absence of a cancer cell type; or (e) a presence or absence of at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 different types of cancer.
  • the sample comprising a plurality of cfDNA molecules was obtained from a human subject.
  • the plurality of genomic regions are selected from List 1 and the detection of cancer comprises a detection of bladder cancer;
  • the plurality of genomic regions are selected from List 2 and the detection of cancer comprises a detection of breast cancer;
  • the plurality of genomic regions are selected from List 3 and the detection of cancer comprises a detection of cervical cancer;
  • the plurality of genomic regions are selected from List 4 and the detection of cancer comprises a detection of colorectal cancer;
  • the plurality of genomic regions are selected from List 5 and the detection of cancer comprises a detection of head and neck cancer;
  • the plurality of genomic regions are selected from List 6 and the detection of cancer comprises a detection of hepatobiliary cancer;
  • the plurality of genomic regions are selected from List 7 and the detection of cancer comprises a detection of lung cancer;
  • the plurality of genomic regions are selected from List 8 and the detection of cancer comprises a detection of melanoma;
  • the plurality of genomic regions are selected from List 9 and the
  • the plurality of genomic regions are selected from List 16 or List 33 and the detection of cancer comprises a detection of anorectal cancer; the plurality of genomic regions are selected from List 17 or List 34 and the detection of cancer comprises a detection of bladder or urothelial cancer; the plurality of genomic regions are selected from List 18 or List 35 and the detection of cancer comprises a detection of breast cancer; the plurality of genomic regions are selected from List 19 or List 36 and the detection of cancer comprises a detection of cervical cancer; the plurality of genomic regions are selected from List 20 or List 37 and the detection of cancer comprises a detection of colorectal cancer; the plurality of genomic regions are selected from List 21 or List 38 and the detection of cancer comprises a detection of head and neck cancer; the plurality of genomic regions are selected from List 22 or List 39 and the detection of cancer comprises a detection of liver or bile duct cancer; the plurality of genomic regions are selected from List 23 or List 40 and the detection of cancer comprises a detection of lung cancer; the plurality of genomic regions are selected from List 23 or List 40 and the detection of
  • gastrointestinal tract cancer or the plurality of genomic regions are selected from List 32 or List 49 and the detection of cancer comprises a detection of uterine cancer.
  • the plurality of genomic regions comprises at least 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, or 100% of the genomic regions of the List. In some embodiments, the plurality of genomic regions comprises at least 30, 50, 100, 150, 200, 250, or 300 of the genomic regions of the List. In some embodiments, the plurality of genomic regions comprises less than 90%, 80%, 70%, 60%, 50%, 40%, 30% or 20% of the genomic regions of the List. In some embodiments, the plurality of genomic regions comprises less than 25000, 20000, 15000, 10000, 7500, 5000, or 2500 of the genomic regions of the List. In some embodiments, the plurality of genomic regions comprises less than 1000, 500, 400, 300, 200, or 100 of the genomic regions of the List.
  • cancer assay panels comprising a plurality of probes, wherein each of the plurality of probes is configured to hybridize to a converted cfDNA molecule corresponding to one or more of a plurality of genomic regions selected from one or more of Lists 1 to 15.
  • the converted cfDNA molecules comprise cfDNA molecules treated to convert unmethylated cytosines to uracils.
  • the plurality of probes are configured to hybridize to nucleic acid molecules
  • the plurality of probes are configured to hybridize to nucleic acid molecules corresponding to, or derived from at least 30, 50, 100, 159, 171, 200, 250, 300, 400, 500, 600, 800, or 1,000 of the genomic regions of a List and the List is one or more of Lists 1 to 15.
  • at least 3%, 5%, 10%, 15%, or 20% of the probes comprise no G (Guanine).
  • each of the probes comprise multiple binding sites to methylation sites of the converted cfDNA molecule, wherein at least 80, 85, 90, 92, 95, or 98% of the multiple binding sites comprise exclusively either CpG or CpA.
  • each of the probes is conjugated to a non-nucleotide affinity moiety.
  • the non-nucleotide affinity moiety is a biotin moiety.
  • the likelihood of a false positive determination of a presence or absence of cancer is less than 1% and the likelihood of an accurate determination of a presence or absence of cancer is at least 40%.
  • the cancer is a stage I cancer, the likelihood of a false positive determination of a presence or absence of cancer is less than 1%, and the likelihood of an accurate
  • Described herein, in certain embodiments, are methods of detecting a cancer type comprising: (i) capturing cfDNA fragments from a subject with a composition comprising a plurality of different oligonucleotide baits, (ii) sequencing the captured cfDNA fragments, and (iii) applying a trained classifier to the cfDNA sequences to determine a cancer type; wherein the oligonucleotide baits are configured to hybridize to cfDNA fragments derived from a plurality of target genomic regions, wherein the plurality of target genomic regions is differentially methylated in one or more cancer types relative to a different cancer type or a non-cancer type, wherein the likelihood of a false-positive determination of cancer is less than 1%, and wherein the likelihood of an accurate assignment of a cancer type is at least 75%,
  • the cancer type is selected from uterine cancer, upper GI squamous cancer, all other upper GI cancers, thyroid cancer, sarcoma, urothelial renal cancer, all other renal cancers, prostate cancer, pancreatic cancer, ovarian cancer, neuroendocrine cancer, multiple myeloma, melanoma, lymphoma, small cell lung cancer, lung adenocarcinoma, all other lung cancers, leukemia, hepatobiliary carcinoma, hepatobiliary biliary, head and neck cancer, colorectal cancer, cervical cancer, breast cancer, bladder cancer, and anorectal cancer.
  • the cancer type is selected from anal cancer, bladder cancer, colorectal cancer, esophageal cancer, head and neck cancer, liver/bile-duct cancer, lung cancer, lymphoma, ovarian cancer, pancreatic cancer, plasma cell neoplasm, and stomach cancer.
  • the cancer type is selected from thyroid cancer, melanoma, sarcoma, myeloid neoplasm, renal cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothelial cancer, cervical cancer, anorectal cancer, head & neck cancer, colorectal cancer, liver cancer, bile duct cancer, pancreatic cancer, gallbladder cancer, upper GI cancer, multiple myeloma, lymphoid neoplasm, and lung cancer.
  • the cancer type is a stage I cancer type, and the likelihood of an accurate assignment is at least 70% or at least 75%. In some embodiments, the cancer type is a stage II cancer type, and the likelihood of an accurate assignment is at least 85%.
  • the cancer type is anorectal cancer
  • the target genomic regions are selected from Lists 16 or 33
  • the accuracy of detecting anorectal cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II anorectal cancer
  • the target genomic regions are selected from Lists 16 or 33
  • the accuracy of detecting stage I or stage II anorectal cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is bladder & urothelial cancer
  • the target genomic regions are selected from Lists 1, 17 or 34
  • the accuracy of detecting bladder & urothelial cancer among samples with detected cancer is at least 80% or 90%.
  • the cancer type is stage I or stage II bladder & urothelial cancer
  • the target genomic regions are selected from Lists 1, 17 or 34
  • the accuracy of stage I or stage II detecting bladder & urothelial cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is breast cancer
  • the target genomic regions are selected from Lists 2, 18 or 35
  • the accuracy of detecting breast cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II breast cancer
  • the target genomic regions are selected from Lists 2, 18 or 35
  • the accuracy of detecting stage I or stage II breast cancer among samples with detected cancer is at least 75% or 84%.
  • the cancer type is cervical cancer
  • the target genomic regions are selected from Lists 3, 19 or 36
  • the accuracy of detecting cervical cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II cervical cancer
  • the target genomic regions are selected from Lists 3, 19 or 36
  • the accuracy of detecting stage I or stage II cervical cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is colorectal cancer
  • the target genomic regions are selected from Lists 4, 20 or 37
  • the accuracy of detecting colorectal cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II colorectal cancer
  • the target genomic regions are selected from Lists 4, 20 or 37
  • the accuracy of detecting stage I or stage II colorectal cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is head & neck cancer
  • the target genomic regions are selected from Lists 5, 21 or 38
  • the accuracy of detecting head & neck cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II head & neck cancer
  • the target genomic regions are selected from Lists 5, 21 or 38
  • the accuracy of detecting stage I or stage II head & neck cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is liver & bile duct cancer
  • the target genomic regions are selected from Lists 6, 22, or 39
  • the accuracy of detecting liver & bile duct cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II liver & bile duct cancer
  • the target genomic regions are selected from Lists 6, 22, or 39
  • the accuracy of detecting stage I or stage II liver & bile duct cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is lung cancer
  • the target genomic regions are selected from Lists 7, 23 or 40
  • the accuracy of detecting lung cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II lung cancer
  • the target genomic regions are selected from Lists 7, 23 or 40
  • the accuracy of detecting stage I or stage II lung cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is melanoma
  • the target genomic regions are selected from Lists 8, 24 or 41
  • the accuracy of detecting melanoma among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II melanoma
  • the target genomic regions are selected from Lists 8, 24 or 41
  • the accuracy of detecting stage I or stage II melanoma among samples with detected cancer is at least 75% or 84%.
  • the cancer type is ovarian cancer
  • the target genomic regions are selected from Lists 9, 25 or 42
  • the accuracy of detecting ovarian cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II ovarian cancer
  • the target genomic regions are selected from Lists 9, 25 or 42
  • the accuracy of detecting stage I or stage II ovarian cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is pancreas & gallbladder cancer
  • the target genomic regions are selected from Lists 10, 26 or 43
  • the accuracy of detecting pancreas & gallbladder cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II pancreas & gallbladder cancer
  • the target genomic regions are selected from Lists 10, 26 or 43
  • the accuracy of detecting stage I or stage II pancreas & gallbladder cancer among samples with detected cancer is at least 75%, 81% or 83%.
  • the cancer type is prostate cancer, the target genomic regions are selected from Lists 11, 27 or 44, and the accuracy of detecting prostate cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II prostate cancer, the target genomic regions are selected from Lists 11, 27 or 44, and the accuracy of detecting stage I or stage II prostate cancer among samples with detected cancer is at least 75% or 83%.
  • the cancer type is renal cancer, the target genomic regions are selected from Lists 12, 28 or 45, and the accuracy of detecting renal cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II renal cancer, the target genomic regions are selected from Lists 12, 28 or 45, and the accuracy of detecting stage I or stage II renal cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is sarcoma
  • the target genomic regions are selected from Lists 29 or 46
  • the accuracy of detecting sarcoma among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II sarcoma
  • the target genomic regions are selected from Lists 29 or 46
  • the accuracy of detecting stage I or stage II sarcoma among samples with detected cancer is at least 75% or 83%.
  • the cancer type is thyroid cancer
  • the target genomic regions are selected from Lists 13, 30 or 47
  • the accuracy of detecting thyroid cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II thyroid cancer
  • the target genomic regions are selected from Lists 13, 30 or 47
  • the accuracy of detecting stage I or stage II thyroid cancer among samples with detected cancer is at least 75% or 87%.
  • the cancer type is upper gastrointestinal tract cancer
  • the target genomic regions are selected from Lists 14, 31 or 48
  • the accuracy of detecting upper gastrointestinal tract cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II upper gastrointestinal tract cancer
  • the target genomic regions are selected from Lists 14, 31 or 48
  • the accuracy of detecting stage I or stage II upper gastrointestinal tract cancer among samples with detected cancer is at least 75% or 83%.
  • the cancer type is uterine cancer
  • the target genomic regions are selected from Lists 15, 32 or 49
  • the accuracy of detecting uterine cancer among samples with detected cancer is at least 80% or 88%.
  • the cancer type is stage I or stage II uterine cancer
  • the target genomic regions are selected from Lists 16 or 33
  • the accuracy of detecting stage I or stage II uterine cancer among samples with detected cancer is at least 75% or 85%.
  • the cancer type is anorectal cancer
  • the target genomic regions are selected from Lists 16 or 33
  • the sensitivity for anorectal cancer is at least 65% or 75%.
  • the cancer type is stage I or stage II anorectal cancer
  • the target genomic regions are selected from Lists 16 or 33
  • the sensitivity for stage I or stage II anorectal cancer is at least 65% or 55%.
  • the cancer type is bladder & urothelial cancer
  • the target genomic regions are selected from Lists 1, 17 or 34
  • the sensitivity for bladder & urothelial cancer is at least 50% or 40%.
  • the cancer type is stage I or stage II bladder & urothelial cancer
  • the target genomic regions are selected from Lists 1, 17 or 34
  • the accuracy of stage I or stage II detecting bladder & urothelial cancer is at least 40% or 50%.
  • the cancer type is breast cancer
  • the target genomic regions are selected from Lists 2, 18 or 35
  • the sensitivity for breast cancer is at least 20% or 25%.
  • the cancer type is stage I or stage II breast cancer
  • the target genomic regions are selected from Lists 2, 18 or 35
  • the sensitivity for stage I or stage II breast cancer is at least 15% or 18%.
  • the cancer type is cervical cancer
  • the target genomic regions are selected from Lists 3, 19 or 36
  • the sensitivity for cervical cancer is at least 25% or 35%.
  • the cancer type is stage I or stage II cervical cancer
  • the target genomic regions are selected from Lists 3, 19 or 36
  • the sensitivity for stage I or stage II cervical cancer is at least 17% or 22%.
  • the cancer type is colorectal cancer
  • the target genomic regions are selected from Lists 4, 20 or 37
  • the sensitivity for colorectal cancer is at least 55% or 65%.
  • the cancer type is stage I or stage II colorectal cancer
  • the target genomic regions are selected from Lists 4, 20 or 37
  • the sensitivity for stage I or stage II colorectal cancer is at least 25%, 29% or 34%.
  • the cancer type is head & neck cancer
  • the target genomic regions are selected from Lists 5, 21 or 38
  • the sensitivity for head & neck cancer is at least 70% or 80%.
  • the cancer type is stage I or stage II head & neck cancer
  • the target genomic regions are selected from Lists 5, 21 or 38
  • the sensitivity for stage I or stage II head & neck cancer is at least 70% or 79%.
  • the cancer type is liver & bile duct cancer
  • the target genomic regions are selected from Lists 6, 22, or 39
  • the sensitivity for liver & bile duct cancer is at least 75% or 85%.
  • the cancer type is stage I or stage II liver & bile duct cancer
  • the target genomic regions are selected from Lists 6, 22, or 39
  • the sensitivity for stage I or stage II liver & bile duct cancer is at least 65% or 75%.
  • the cancer type is lung cancer, the target genomic regions are selected from Lists 7, 23 or 40, and the sensitivity for lung cancer is at least 55% or 62%.
  • the cancer type is stage I or stage II lung cancer, the target genomic regions are selected from Lists 7, 23 or 40, and the sensitivity for stage I or stage II lung cancer is at least 20% or 25%.
  • the cancer type is melanoma, the target genomic regions are selected from Lists 8, 24 or 41, and the sensitivity for melanoma is at least 40% or 30%.
  • the cancer type is ovarian cancer
  • the target genomic regions are selected from Lists 9, 25 or 42
  • the sensitivity for ovarian cancer is at least 70% or 80%.
  • the cancer type is pancreas & gallbladder cancer
  • the target genomic regions are selected from Lists 10, 26 or 43
  • the sensitivity for pancreas & gallbladder cancer is at least 60%, 70% or 74%.
  • the cancer type is stage I or stage II pancreas & gallbladder cancer
  • the target genomic regions are selected from Lists 10, 26 or 43
  • the sensitivity for stage I or stage II pancreas & gallbladder cancer is at least 40% or 50%.
  • the cancer type is sarcoma
  • the target genomic regions are selected from Lists 29 or 46
  • the sensitivity for sarcoma is at least 40% or 50%.
  • the cancer type is upper gastrointestinal tract cancer
  • the target genomic regions are selected from Lists 14, 31 or 48
  • gastrointestinal tract cancer is at least 70% or 60%.
  • the cancer type is stage I or stage II upper gastrointestinal tract cancer
  • the target genomic regions are selected from Lists 14, 31 or 48
  • the sensitivity for stage I or stage II upper gastrointestinal tract cancer is at least 35% or 45%.
  • the composition comprising oligonucleotide baits is the composition of any one of the compositions described herein or any one of the cancer assay panels described herein.
  • the plurality of genomic regions comprises no more than 1700, 1300, 900, 700 or 400 genomic regions.
  • the total size of the plurality of genomic regions is less than 4 MB, less than 2 MB, less than 1100 kb, less than 750 kb, less than 270 kb, less than 200 kb, less than 150 kb, less than 100 kb, or less than 50 kb.
  • the subject has an elevated risk of one or more cancer types.
  • the subject manifests symptoms associated with one or more cancer types.
  • the subject has not been diagnosed with a cancer.
  • the classifier was trained on converted DNA sequences derived from a least 100 subjects with a first cancer type, at least 100 subjects with a second cancer type, and at least 100 subjects with no cancer.
  • the first cancer type is ovarian cancer. In some embodiments, the first cancer type is colorectal cancer.
  • the first cancer type is selected from thyroid cancer, melanoma, sarcoma, myeloid neoplasm, renal cancer, prostate cancer, breast cancer, uterine cancer, ovarian cancer, bladder cancer, urothecal cancer, cervical cancer, anorectal cancer head & neck cancer, colorectal cancer, liver cancer, pancreatic cancer, gallbladder cancer, esophageal cancer, stomach cancer, multiple myeloma, lymphoid neoplasm, lung cancer, or leukemia.
  • the classifier was trained on converted DNA sequences derived from at least 1000, at least 2000, or at least 4000 target genomic regions selected from any one of Lists 1-49.
  • the trained classifier determines the presence or absence of cancer or a cancer type by: (i) generating a set of features for the sample, wherein each feature in the set of features comprises a numerical value; (ii) inputting the set of features into the classifier, wherein the classifier comprises a multinomial classifier; (iii) based on the set of features, determining, at the classifier, a set of probability scores, wherein the set of probability scores comprises one probability score per cancer type class and per non-cancer type class; and (iv) thresholding the set of probability scores based on one or more values determined during training of the classifier to determine a final cancer classification of the sample.
  • the set of features comprises a set of binarized features.
  • the numerical value comprises a single binary value.
  • the multinomial classifier comprises a multinomial logistic regression ensemble trained to predict a source tissue for the cancer.
  • the method further comprises determining the final cancer classification based on a top-two probability score differential relative to a minimum value, wherein the minimum value corresponds to a predefined percentage of training cancer samples that had been assigned the correct cancer type as their highest score during training of the classifier.
  • the anti-cancer agent is a chemotherapeutic agent selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, anti-tumor antibiotics, cytoskeletal disruptors (taxans), topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, and platinum-based agents.
  • FIG. 1A illustrates a 2x tiled probe design, with three probes targeting a small target region, where each base in a target region (boxed in the dotted rectangle) is covered by at least two probes, according to an embodiment.
  • FIG. IB illustrates a 2x tiled probe design, with more than three probes targeting a larger target region, where each base in a target region (boxed in the dotted rectangle) is covered by at least two probes, according to an embodiment.
  • FIG. 1C illustrates probe design targeting hypomethylated and/or hypermethylated fragments in genomic regions, according to an embodiment.
  • FIG. 2 illustrates a process of generating a cancer assay panel, according to an embodiment.
  • FIG. 3A is a flowchart describing a process of creating a data structure for a control group, according to an embodiment.
  • FIG. 3B is a flowchart describing an additional step of validating the data structure for the control group of FIG. 3A, according to an embodiment.
  • FIG. 4 is a flowchart describing a process for selecting genomic regions for designing probes for a cancer assay panel, according to an embodiment.
  • FIG. 5 is an illustration of an example p-value score calculation, according to an embodiment.
  • FIG. 6A is a flowchart describing a process of training a classifier based on
  • hypomethylated and hypermethylated fragments indicative of cancer are hypomethylated and hypermethylated fragments indicative of cancer, according to an embodiment.
  • FIG. 6B is a flowchart describing a process of identifying fragments indicative of cancer determined by probabilistic models, according to an embodiment.
  • FIG. 7A is a flowchart describing a process of sequencing a fragment of cell-free (cf) DNA, according to an embodiment.
  • FIG. 7B is an illustration of the process of FIG. 7A of sequencing a fragment of cell-free (cf) DNA to obtain a methylation state vector, according to an embodiment.
  • FIG. 8A illustrates extent of bisulfite conversion (upper panel) and mean
  • FIG. 8B illustrates concentration of cfDNA per sample across varying stages of cancer.
  • FIG. 9 is a graph of the amounts of DNA fragments binding to probes depending on the sizes of overlaps between the DNA fragments and the probes.
  • FIG. 10A illustrates a flowchart of devices for sequencing nucleic acid samples according to one embodiment.
  • FIG. 10B illustrates an analytic system that analyzes methylation status of cfDNA according to one embodiment.
  • FIG. 11 is a color-coded graph presenting numbers of genomic regions selected for differentiating each target TOO (x-axis) from a contrast TOO (y-axis).
  • FIG. 12 provides data for verifying selected genomic regions using cfDNA and WBG gDNA. Fractions (y-axis) classifying each TOO (x-axis) correctly are provided.
  • FIG. 13 is a receiver operator curve comparing the true positive rate and false positive rate of cancer detection by a trained classifier utilizing methylation status information from the target genomic regions of list 23 (optimized for lung cancer).
  • any reference to“one embodiment” or“an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment.
  • the appearances of the phrase“in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, thereby providing a framework for various possibilities of described embodiments to function together.
  • the terms“comprises,”“comprising,”“includes,”“including,”“has,” “having” or any other variation thereof are intended to cover a non-exclusive inclusion.
  • a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
  • “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
  • ranges and amounts can be expressed as“about” a particular value or range. About also includes the exact amount. Hence“about 5 pg” means“about 5 pg” and also “5 pg.” Generally, the term“about” includes an amount that would be expected to be within experimental error. In some embodiments,“about” refers to the number or value recited,“+” or 20%, 10%, or 5% of the number or value. Additionally, ranges recited herein are understood to be shorthand for all of the values within the range, inclusive of the recited endpoints. For example, a range of 1 to 50 is understood to include any number, combination of numbers, or sub-range from the group consisting of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
  • methylation refers to a process by which a methyl group is added to a DNA molecule.
  • a hydrogen atom on the pyrimidine ring of a cytosine base can be converted to a methyl group, forming 5-methylcytosine.
  • the term also refers to a process by which a hydroxymethyl group is added to a DNA molecule, for example by oxidation of a methyl group on the pyrimidine ring of a cytosine base.
  • CpG sites hydroxymethylation tend to occur at dinucleotides of cytosine and guanine referred to herein as “CpG sites.”
  • methylation can also refer to the methylation status of a CpG site.
  • a CpG site with a 5-methylcytosine moiety is methylated.
  • a CpG site with a hydrogen atom on the pyrimidine ring of the cytosine base is unmethylated.
  • the wet laboratory assay used to detect methylation may vary from those described herein as is well known in the art.
  • methylation site refers to a region of a DNA molecule where a methyl group can be added.“CpG” sites are the most common methylation site, but methylation sites are not limited to CpG sites.
  • DNA methylation may occur in cytosines in CHG and CHH, where H is adenine, cytosine or thymine. Cytosine methylation in the form of 5-hydroxymethylcytosine may also assessed (see, e.g., WO 2010/037001 and WO 2011/127136, which are incorporated herein by reference), and features thereof, using the methods and procedures disclosed herein.
  • CpG site refers to a region of a DNA molecule where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases along its 5' to 3' direction.“CpG” is a shorthand for 5'-C-phosphate-G-3' that is cytosine and guanine separated by only one phosphate group. Cytosines in CpG dinucleotides can be methylated to form 5-methylcytosine.
  • CpG detection site refers to a region in a probe that is configured to hybridize to a CpG site of a target DNA molecule.
  • the CpG site on the target DNA molecule can comprise cytosine and guanine separated by one phosphate group, where cytosine is methylated or unmethylated.
  • the CpG site on the target DNA molecule can comprise uracil and guanine separated by one phosphate group, where the uracil is generated by the conversion of unmethylated cytosine.
  • UpG is a shorthand for 5'-U-phosphate-G-3' that is uracil and guanine separated by only one phosphate group. UpG can be generated by a bisulfite treatment of a DNA that converts unmethylated cytosines to uracils. Cytosines can be converted to uracils by other methods known in the art, such as chemical modification, synthesis, or enzymatic conversion.
  • hypomethylated refers to a methylation status of a DNA molecule containing multiple CpG sites (e.g., more than 3, 4, 5, 6, 7, 8, 9, 10, etc.) where a high percentage of the CpG sites (e.g., more than 80%, 85%, 90%, or 95%, or any other percentage within the range of 50%-100%) are unmethylated or methylated, respectively.
  • methylation state vector or“methylation status vector” as used herein refers to a vector comprising multiple elements, where each element indicates the methylation status of a methylation site in a DNA molecule comprising multiple methylation sites, in the order they appear from 5' to 3' in the DNA molecule. For example, ⁇ M x , M x+i , M x +2 >, ⁇ M x , M x+i , U x +2 >, . .
  • ⁇ U x , U x+i , U x +2 > can be methylation vectors for DNA molecules comprising three methylation sites, where M represents a methylated methylation site and U represents an unmethylated methylation site.
  • the term“abnormal methylation pattern” or“anomalous methylation pattern” as used herein refers to the methylation pattern of a DNA molecule or a methylation state vector that is expected to be found in a sample less frequently than a threshold value.
  • the expectedness of finding a specific methylation state vector in a healthy control group comprising healthy individuals is represented by a p-value.
  • a low p-value score generally corresponds to a methylation state vector which is relatively unexpected in comparison to other methylation state vectors within samples from healthy individuals.
  • a high p-value score generally corresponds to a methylation state vector which is relatively more expected in comparison to other methylation state vectors found in samples from healthy individuals in the healthy control group.
  • a methylation state vector having a p-value lower than a threshold value e.g., 0.1, 0.01, 0.001, 0.0001, etc.
  • a threshold value e.g. 0., 0.01, 0.001, 0.0001, etc.
  • Various methods known in the art can be used to calculate a p-value or expectedness of a methylation pattern or a methylation state vector. Exemplary methods provided herein involve use of a Markov chain probability that assumes methylation statuses of CpG sites to be dependent on methylation statuses of neighboring CpG sites.
  • Alternate methods provided herein calculate the expectedness of observing a specific methylation state vector in healthy individuals by utilizing a mixture model including multiple mixture components, each being an independent- sites model where methylation at each CpG site is assumed to be independent of methylation statuses at other CpG sites.
  • genomic sample refers to a sample comprising genomic DNAs from an individual diagnosed with cancer.
  • the genomic DNAs can be, but are not limited to, cfDNA fragments or chromosomal DNAs from a subject with cancer.
  • the genomic DNAs can be sequenced (or otherwise detected) and their methylation status can be assessed by methods known in the art, for example, bisulfite sequencing.
  • genomic sequences are obtained from public database (e.g., The Cancer Genome Atlas (TCGA)) or experimentally obtained by sequencing a genome of an individual diagnosed with cancer
  • cancerous sample can refer to genomic DNAs or cfDNA fragments having the genomic sequences.
  • cancer samples refers to samples comprising genomic DNAs from multiple individuals, each individual diagnosed with cancer.
  • cancerous samples from more than 100, 300, 500, 1,000, 2,000, 5,000, 10,000, 20,000, 40,000, 50,000, or more individuals diagnosed with cancer are used.
  • non-cancerous sample or“healthy sample” as used herein refers to a sample comprising genomic DNAs from an individual not diagnosed with cancer.
  • the genomic DNAs can be, but are not limited to, cfDNA fragments or chromosomal DNAs from a subject without cancer.
  • the genomic DNAs can be sequenced (or otherwise detected) and their methylation status can be assessed by methods known in the art, for example, bisulfite sequencing.
  • genomic sequences are obtained from public database (e.g., The Cancer Genome Atlas (TCGA)) or experimentally obtained by sequencing a genome of an individual without cancer
  • non- cancerous sample can refer to genomic DNAs or cfDNA fragments having the genomic sequences.
  • non-cancerous samples refers to samples comprising genomic DNAs from multiple individuals, each individual is without cancer.
  • healthy samples from more than 100, 300, 500, 1,000, 2,000, 5,000, 10,000, 20,000, 40,000, 50,000, or more individuals without cancer are used.
  • the term“training sample” as used herein refers to a sample used to train a classifier described herein and/or to select one or more genomic regions for cancer detection or detecting a cancer tissue of origin or cancer cell-type.
  • the training samples can comprise genomic DNAs or a modification there of, from one or more healthy subjects and from one or more subjects having a disease condition (e.g., cancer, a specific type of cancer, a specific stage of cancer, etc.).
  • the genomic DNAs can be, but are not limited to, cfDNA fragments or chromosomal DNAs.
  • the genomic DNAs can be sequenced (or otherwise detected) and their methylation status can be assessed by methods known in the art, for example, bisulfite sequencing.
  • TCGA The Cancer Genome Atlas
  • test sample refers to a sample from a subject, whose health condition was, has been or will be tested using a classifier and/or an assay panel described herein.
  • the test sample can comprise genomic DNAs or a modification there of.
  • the genomic DNAs can be, but are not limited to, cfDNA fragments or chromosomal DNAs.
  • target genomic region refers to a region in a genome selected for analysis in test samples.
  • An assay panel is generated with probes designed to hybridize to (and optionally pull down) nucleic acid fragments derived from the target genomic region or a fragment thereof.
  • a nucleic acid fragment derived from the target genomic region refers to a nucleic acid fragment generated by degradation, cleavage, bisulfite conversion, or other processing of the DNA from the target genomic region.
  • sequence listing includes the following information: (1) the chromosome on which the region is located, along with the start and stop position of the genomic region, (2) whether the region is hypo or hypermethylated in cancer (or“binary” if the both the hypomethylated and hypermethylated are informative).
  • the chromosome numbers and the start and stop positions are provided relative to a known human reference genome, hgl9.
  • the sequence of the human reference genome, hgl9 is available from Genome Reference
  • Probes can be designed to hybridize to one or both sequences.
  • probes hybridize to converted sequences resulting from, for example, treatment with sodium bisulfite.
  • off-target genomic region refers to a region in a genome which has not been selected for analysis in test samples, but has sufficient homology to a target genomic region to potentially be bound and pulled down by a probe designed to target the target genomic region.
  • an off-target genomic region is a genomic region that aligns to a probe along at least 45 bp with at least a 90% match rate.
  • the terms“converted DNA molecules,”“converted cfDNA molecules,” and“modified fragment obtained from processing of the cfDNA molecules” refer to DNA molecules obtained by processing DNA or cfDNA molecules in a sample for the purpose of differentiating a methylated nucleotide and an unmethylated nucleotide in the DNA or cfDNA molecules.
  • the sample can be treated with bisulfite ion (e.g., using sodium bisulfite), as is well-known in the art, to convert unmethylated cytosines (“C”) to uracils (“U”).
  • the conversion of unmethylated cytosines to uracils is accomplished using an enzymatic conversion reaction, for example, using a cytidine deaminase (such as APOBEC).
  • converted DNA molecules or cfDNA molecules include additional uracils which are not present in the original cfDNA sample. Replication by DNA polymerase of a DNA strand comprising a uracil results in addition of an adenine to the nascent complementary strand instead of the guanine normally added as the complement to a cytosine or methylcytosine.
  • cfDNA refers to nucleic acid fragments that circulate in an individual’s body (e.g., bloodstream) and originate from one or more healthy cells and/or from one or more cancerous cells. Additionally, cfDNA may come from other sources such as viruses, fetuses, etc.
  • circulating tumor DNA or“ctDNA” refers to nucleic acid fragments that originate from tumor cells, which may be released into an individual’s bloodstream as result of biological processes such as apoptosis or necrosis of dying cells or actively released by viable tumor cells.
  • fragment can refer to a fragment of a nucleic acid molecule.
  • a fragment can refer to a cfDNA molecule in a blood or plasma sample, or a cfDNA molecule that has been extracted from a blood or plasma sample.
  • An amplification product of a cfDNA molecule may also be referred to as a“fragment.”
  • the term“fragment” refers to a sequence read, or set of sequence reads, that have been processed for subsequent analysis (e.g., for in machine-learning based classification), as described herein.
  • raw sequence reads can be aligned to a reference genome and matching paired end sequence reads assembled into a longer fragment for subsequent analysis.
  • the term“individual” refers to a human individual.
  • the term“healthy individual” refers to an individual presumed not to have a cancer or disease.
  • subject refers to an individual whose DNA is being analyzed.
  • a subject may be a test subject whose DNA is be evaluated using a targeted panel as described herein to evaluate whether the person has cancer or another disease.
  • a subject may also be part of a control group known not to have cancer or another disease.
  • a subject may also be part of a cancer or other disease group known to have cancer or another disease. Control and
  • cancer/disease groups may be used to assist in designing or validating the targeted panel.
  • sequence reads refers to nucleotide sequences reads from a sample. Sequence reads can be obtained through various methods provided herein or as known in the art.
  • sequencing depth refers to the count of the number of times a given target nucleic acid within a sample has been sequenced (e.g., the count of sequence reads at a given target region). Increasing sequencing depth can reduce required amounts of nucleic acids required to assess a disease state (e.g., cancer or cancer tissue of origin).
  • tissue of origin refers to the organ, organ group, body region or cell type that a cancer arises or originates from.
  • the identification of a tissue of origin or cancer cell type typically allows for identification of the most appropriate next steps in the care continuum of cancer to further diagnose, stage and decide on treatment.
  • transition generally refers to changes in base composition from one purine to another purine, or from one pyrimidine to another pyrimidine. For instance, the following changes are transitions: C->U, U->C, G->A, A- G, C- T, and T- C.
  • a panel or bait set generally refers to all of the probes delivered with a specified panel or bait set.
  • a panel or bait set may include both (1) probes having features specified herein (e.g., probes for binding to cell-free DNA fragments
  • probes of a panel generally refers to all probes delivered with the panel or bait set, including such probes that do not contain the specified feature(s).
  • the present description provides a cancer assay panel comprising a plurality of probes or a plurality of probe pairs.
  • the assay panels described herein can alternatively be referred to as bait sets or as compositions comprising bait oligonucleotides.
  • the probes are specifically designed to target one or more nucleic acid molecules corresponding to, or derived from genomic regions differentially methylated between cancer and non-cancer samples, between different cancer tissue of origin (TOO) types, between different cancer cell type, or between samples of different stages of cancer, as identified by methods provided herein.
  • TOO cancer tissue of origin
  • probes target genomic regions (or nucleic acid molecules derived therefrom) having methylation patterns specific to a cancer type, e.g., (1) bladder cancer, (2) breast cancer, (3) cervical cancer, (4) colorectal cancer, (5) head and neck cancer, (6) hepatobiliary cancer, (7) lung cancer, (8) melanoma, (9) ovarian cancer, (10) pancreatic cancer, (11) prostate cancer, (12) renal cancer, (13) thyroid cancer, (14) upper gastrointestinal cancer, or (15) uterine cancer.
  • the panel includes probes targeting genomic regions specific to a single cancer type.
  • the panel includes probes specific to 2, 3, 4, 5, 6, 7, 8, ,9, 10, 11, 12, 13, 14, 15 or more cancer types.
  • the target genomic regions are selected to maximize classification accuracy, subject to a size budget (which is determined by sequencing budget and desired depth of sequencing).
  • an analytics system may collect samples corresponding to various outcomes under consideration, e.g., samples known to have cancer, samples considered to be healthy, samples from a known tissue of origin, etc.
  • the sources of the cfDNA and/or ctDNA used to select target genomic regions can vary depending on the purpose of the assay. For example, different sources may be desirable for an assay intended to diagnose cancer generally, a specific type of cancer, a cancer stage, or a tissue of origin.
  • These samples may be processed using one or more methods known in the art to determine the methylation status of CpG sites (e.g., with whole-genome bisulfite sequencing (WGBS)), or the information may be obtained from a public database (e.g., TCGA).
  • the analytics system may be any generic computing system with a computer processor and a computer-readable storage medium with instructions for executing the computer processor to perform any or all operations described in this present disclosure.
  • the cancer assay panel design and utility is generally described in FIG. 2.
  • an analytics system collects samples corresponding to various outcomes under consideration, e.g., samples known to have cancer, samples considered to be healthy, samples from a known TOO, etc. These samples may be processed with whole-genome bisulfite sequencing (WGBS) or obtained from public database (e.g., TCGA).
  • the analytics system may be any generic computing system with a computer processor and a computer- readable storage medium with instructions for executing the computer processor to perform any or all operations described in this present disclosure. With the samples, the analytics system determines methylation statuses at CpG sites for each fragment in the sample.
  • the analytics system may then select target genomic regions for inclusion in a cancer assay panel based on methylation patterns of nucleic acid fragments.
  • One approach considers pairwise distinguishability between pairs of outcomes (e.g., one cancer type vs. a second cancer type) for selection of targeted regions.
  • Another approach considers distinguishability for target genomic regions when considering each outcome against the remaining outcomes (e.g., one cancer type vs. all other cancer types). From the selected target genomic regions with high distinguishability power, the analytics system may design probes to target nucleic acid fragments inclusive of, or derived from, the selected genomic regions.
  • the analytics system may generate variable sizes of the cancer assay panel, e.g., where a small sized cancer assay panel includes probes targeting the most informative genomic region, a medium sized cancer assay panel includes probes from the small sized cancer assay panel and additional probes targeting a second tier of informative genomic regions, and a large sized cancer assay panel includes probes from the small sized and the medium sized cancer assay panels and even more probes targeting a third tier of informative genomic regions.
  • the analytics system may train classifiers with various classification techniques to predict a sample’s likelihood of having a particular outcome or state, e.g., cancer, specific cancer type, other disorder, other disease, etc.
  • an analytics system may collect information on the methylation status of CpG sites of nucleic acid fragments from samples corresponding to various outcomes under consideration, e.g., samples known to have cancer, samples considered to be healthy, samples from a known TOO, etc. These samples may be processed (e.g., with whole-genome bisulfite sequencing (WGBS)) to determine the methylation status of CpG sites, or the information may be obtained from TCGA.
  • the analytics system may be any generic computing system with a computer processor and a computer-readable storage medium with instructions for executing the computer processor to perform any or all operations described in this present disclosure.
  • the cancer assay panel comprises at least 500 pairs of probes, wherein each pair of the at least 500 pairs comprises two probes configured to overlap each other by an overlapping sequence, wherein the overlapping sequence comprises at least 30- nucleotides, and wherein each probe is configured to hybridize to a converted DNA (e.g., a cfDNA) molecule corresponding to one or more genomic regions.
  • each of the genomic regions comprises at least five methylation sites, and wherein the at least five methylation sites have an abnormal methylation pattern in cancerous samples or a different methylation pattern between samples of a different TOO.
  • each pair of probes comprises a first probe and a second probe, wherein the second probe differs from the first probe.
  • the second probe can overlap with the first probe by an overlapping sequence that is at least 30, at least 40, at least 50, or at least 60 nucleotides in length.
  • the target genomic regions can be selected from any one of Lists 1-49 (TABLE 1).
  • the cancer assay panel comprises a plurality of probes, wherein each of the plurality of probes is configured to hybridize to a converted cfDNA molecule corresponding to one or more of the genomic regions in any one of Lists 1-49 or any combination of lists thereof.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 20% of the target genomic regions of any one of Lists 1- 49.
  • the plurality of different bait oligonucleotides are configured to hybridize to DNA molecules derived from at least 30%, 40%, 50%, 60%, 70%, or 80% of the target genomic regions of any one of Lists 1-49.
  • the target genomic regions can be selected from List 1.
  • a method for detecting bladder cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 1.
  • the target genomic regions can be selected from List 2.
  • a method for detecting breast cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 2.
  • the target genomic regions can be selected from List 3.
  • a method for detecting cervical cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 3.
  • the target genomic regions can be selected from List 4.
  • a method for detecting colorectal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 4.
  • the target genomic regions can be selected from List 5.
  • a method for detecting head and neck cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 5.
  • the target genomic regions can be selected from List 6.
  • a method for detecting hepatobiliary cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 6.
  • the target genomic regions can be selected from List 7.
  • a method for detecting lung cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 7.
  • the target genomic regions can be selected from List 8.
  • a method for detecting melanoma comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 8.
  • the target genomic regions can be selected from List 9.
  • a method for detecting ovarian cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 9.
  • the target genomic regions can be selected from List 10.
  • a method for detecting pancreatic cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 10.
  • the target genomic regions can be selected from List 11.
  • a method for detecting prostate cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 11.
  • the target genomic regions can be selected from List 12.
  • a method for detecting renal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 12.
  • the target genomic regions can be selected from List 13.
  • a method for detecting thyroid cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 13.
  • the target genomic regions can be selected from List 14.
  • a method for detecting upper gastrointestinal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 14.
  • the target genomic regions can be selected from List 15.
  • a method for detecting uterine cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 15.
  • the target genomic regions can be selected from List 16.
  • a method for detecting anorectal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 16.
  • the target genomic regions can be selected from List 17.
  • a method for detecting bladder and urothelial cancers comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 17.
  • the target genomic regions can be selected from List 18.
  • a method for detecting breast cancer comprises evaluating the
  • a method for detecting cervical cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 19.
  • the target genomic regions can be selected from List 20.
  • a method for detecting colorectal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 20.
  • the target genomic regions can be selected from List 21.
  • a method for detecting head and neck cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 21.
  • the target genomic regions can be selected from List 22.
  • a method for detecting liver and bile duct cancers comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 22.
  • the target genomic regions can be selected from List 23.
  • a method for detecting lung cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 23.
  • the target genomic regions can be selected from List 24.
  • a method for detecting melanoma comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 24.
  • the target genomic regions can be selected from List 25.
  • a method for detecting ovarian cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 25.
  • the target genomic regions can be selected from List 26.
  • a method for detecting pancreatic and gallbladder cancers comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 26.
  • the target genomic regions can be selected from List 27.
  • a method for detecting prostate cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 27.
  • the target genomic regions can be selected from List 28.
  • a method for detecting renal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 28.
  • the target genomic regions can be selected from List 29.
  • a method for detecting sarcoma comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 29.
  • the target genomic regions can be selected from List 30.
  • a method for detecting thyroid cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 30.
  • the target genomic regions can be selected from List 31.
  • a method for detecting upper gastrointestinal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 31.
  • the target genomic regions can be selected from List 32.
  • a method for detecting uterine cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 32.
  • the target genomic regions can be selected from List 33.
  • a method for detecting anorectal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 33.
  • the target genomic regions can be selected from List 34.
  • a method for detecting bladder and urothelial cancers comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 34.
  • the target genomic regions can be selected from List 35.
  • a method for detecting breast cancer comprises evaluating the
  • a method for detecting cervical cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 36.
  • the target genomic regions can be selected from List 37.
  • a method for detecting colorectal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 37.
  • the target genomic regions can be selected from List 38.
  • a method for detecting head and neck cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 38.
  • the target genomic regions can be selected from List 39.
  • a method for detecting liver and bile duct cancers comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 39.
  • the target genomic regions can be selected from List 40.
  • a method for detecting lung cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 40.
  • the target genomic regions can be selected from List 41.
  • a method for detecting melanoma comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 41.
  • the target genomic regions can be selected from List 42.
  • a method for detecting ovarian cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 42.
  • the target genomic regions can be selected from List 43.
  • a method for detecting pancreatic and gallbladder cancers comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 43.
  • the target genomic regions can be selected from List 44.
  • a method for detecting prostate cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 44.
  • the target genomic regions can be selected from List 45.
  • a method for detecting renal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 45.
  • the target genomic regions can be selected from List 46.
  • a method for detecting sarcoma comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 46.
  • the target genomic regions can be selected from List 47.
  • a method for detecting thyroid comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 47.
  • the target genomic regions can be selected from List 48.
  • a method for detecting upper gastrointestinal cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 48.
  • the target genomic regions can be selected from List 49.
  • a method for detecting uterine cancer comprises evaluating the methylation status for sequencing reads derived from the target genomic regions of List 49.
  • the probes are configured to hybridize to a converted DNA or cfDNA molecule corresponding to, or derived from, one or more genomic regions, the probes can have a sequence different from the targeted genomic region.
  • a DNA containing unmethylated CpG site will be converted to include UpG instead of CpG because unmethylated cytosines are converted to uracils by a conversion reaction (e.g., bisulfite treatment).
  • a probe is configured to hybridize to a sequence including UpG instead of a naturally existing
  • a complementary site in the probe to the unmethylation site can comprise CpA instead of CpG, and some probes targeting a hypomethylated site where all methylation sites are unmethylated can have no guanine (G) bases.
  • G guanine
  • at least 3%, 5%, 10%, 15%, or 20% of the probes comprise no CpG sequences.
  • the cancer assay panel can be used to detect the presence or absence of cancer generally and/or provide a cancer classification such as cancer type, stage of cancer such as I, II, III, or IV, or provide the TOO where the cancer is believed to originate.
  • the panel may include probes targeting genomic regions differentially methylated between general cancerous (pan-cancer) samples and non-cancerous samples, or only in cancerous samples with a specific cancer type (e.g., lung cancer-specific targets).
  • a cancer assay panel is designed to include differentially methylated genomic regions based on bisulfite sequencing data generated from the cfDNA from cancer and non-cancer individuals.
  • Each of the probes is designed to target one or more target genomic regions.
  • the target genomic regions are selected based on several criteria designed to increase selective enriching of informative cfDNA fragments while decreasing noise and non-specific bindings.
  • a panel can include probes that can selectively bind and optionally enrich cfDNA fragments that are differentially methylated in cancerous samples.
  • sequence from the enriched fragments can provide information relevant to detection of cancer.
  • the probes are designed to target genomic regions that are determined to have an abnormal methylation pattern in cancer samples, or in sample from certain tissue types or cell types.
  • probes are designed to target genomic regions determined to be hypermethylated or hypomethylated in certain cancers, or cancer tissue of origins, to provide additional selectivity and specificity of the detection.
  • a panel comprises probes targeting hypomethylated fragments.
  • a panel comprises probes targeting hypermethylated fragments.
  • a panel comprises both a first set of probes targeting hypermethylated fragments and a second set of probes targeting hypomethylated fragments. (FIG.
  • the ratio between the first set of probes targeting hypermethylated fragments and the second set of probes targeting hypomethylated fragments ranges between 0.4 and 2, between 0.5 and 1.8, between 0.5 and 1.6, between 1.4 and 1.6, between 1.2 and 1.4, between 1 and 1.2, between 0.8 and 1, between 0.6 and 0.8 or between 0.4 and 0.6.
  • genomic regions i.e., genomic regions giving rise to differentially methylated DNA molecules or anomalously methylated DNA molecules between cancer and non-cancer samples, between different cancer tissue of origin (TOO) types, between different cancer cell type, or between samples from different stages of cancer
  • TOO cancer tissue of origin
  • genomic regions can be selected when the genomic regions give rise to anomalously methylated DNA molecules in cancer samples or samples with known cancer tissue of origin (TOO) types.
  • TOO cancer tissue of origin
  • a Markov model trained on a set of non-cancerous samples can be used to identify genomic regions that give rise to anomalously methylated DNA molecules (i.e., DNA molecules having a methylation pattern below a p-value threshold).
  • Each of the probes can target a genomic region comprising at least 30bp, 35bp, 40bp, 45bp, 50bp, 60bp, 70bp, 80bp, 90bp, lOObp or more.
  • the genomic regions can be selected to have less than 30, 25, 20, 15, 12, 10, 8, or 6 methylation sites.
  • the genomic regions can be selected when at least 80, 85, 90, 92, 95, or 98% of the at least five methylation (e.g., CpG) sites within the region are either methylated or unmethylated in non-cancerous or cancerous samples, or in cancer samples from a tissue of origin (TOO).
  • at least 80, 85, 90, 92, 95, or 98% of the at least five methylation (e.g., CpG) sites within the region are either methylated or unmethylated in non-cancerous or cancerous samples, or in cancer samples from a tissue of origin (TOO).
  • TEO tissue of origin
  • Genomic regions may be further filtered to select only those that are likely to be informative based on their methylation patterns, for example, CpG sites that are differentially methylated between cancerous and non-cancerous samples (e.g., abnormally methylated or unmethylated in cancer versus non-cancer), between cancerous samples of a TOO and cancerous samples of a different TOO, or CpG sites that are differentially methylated only in cancerous samples of a specific TOO. For the selection, calculation can be performed with respect to each CpG or a plurality of CpG sites.
  • a first count is determined that is the number of cancer-containing samples (cancer count) that include a fragment overlapping that CpG
  • a second count is determined that is the number of total samples containing fragments overlapping that CpG site (total).
  • Genomic regions can be selected based on criteria positively correlated to the number of cancer-containing samples (cancer count) that include a fragment indicative of cancer overlapping that CpG site, and inversely correlated with the number of total samples containing fragments indicative of cancer overlapping that CpG site (total).
  • the number of non-cancerous samples (n n0 n-cancer) and the number of cancerous samples (n ca ncer) having a fragment overlapping a CpG site are counted. Then the probability that a sample is cancer is estimated, for example as (n ca ncer + 1) / (n can cer + n n0 n-c a ncer + 2).
  • CpG sites scored by this metric are ranked and greedily added to a panel until the panel size budget is exhausted.
  • the process of selecting genomic regions indicative of cancer is further detailed herein.
  • the assay is intended to be a pan cancer assay or a single-cancer assay, or depending on what kind of flexibility is desired when picking which CpG sites are contributing to the panel.
  • a panel for detecting a specific cancer type can be designed using a similar process.
  • the information gain is computed to determine whether to include a probe targeting that CpG site.
  • the information gain may be computed for samples with a given cancer type of a TOO compared to all other samples. For example, consider two random variables,
  • CT is a binary random variable indicating whether the cancer is of a particular type (e.g., lung cancer or cancer other than lung).
  • CT is a binary random variable indicating whether the cancer is of a particular type (e.g., lung cancer or cancer other than lung).
  • CpG sites are ranked by this information gain metric, and then greedily added to a panel until the size budget for that cancer type is exhausted.
  • Probes can be filtered to reduce non-specific binding (or off-target binding) to nucleic acids derived from non-targeted genomic regions. For example, probes can be filtered to select only those probes having less than a set threshold of off-target binding events.
  • probes can be aligned to a reference genome (e.g., a human reference genome) to select probes that align to less than a set threshold of regions across the genome.
  • probes can be selected that align to less than 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9 or 8 off-target regions across the reference genome.
  • filtration is performed to remove genomic regions when the sequence of the target genomic regions appears more than 5, 10, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 or 35 times in a genome.
  • Further filtration can be performed to select target genomic regions when a probe sequence, or a set of probe sequences that are 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% homologous to the target genomic regions, appear less than 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9 or 8 times in a reference genome, or to remove target genomic regions when the probe sequence, or a set of probe sequences designed to enrich for the targeted genomic region are 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% homologous to the target genomic regions, appear more than 5, 10, 15, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34 or 35 times in a reference genome. This is for excluding repetitive probes that can pull down off-target fragments, which are not desired and can impact assay efficiency.
  • a fragment-probe overlap of at least 45 bp was demonstrated to be effective for achieving a non-negligible amount of pulldown (though as one of skill in the art would appreciate this number can very) as provided in Example 1.
  • more than a 10% mismatch rate between the probe and fragment sequences in the region of overlap is sufficient to greatly disrupt binding, and thus pulldown efficiency. Therefore, sequences that can align to the probe along at least 45 bp with at least a 90% match rate can be candidates for off- target pulldown.
  • the number of such regions are scored. The best probes have a score of 1, meaning they match in only one place (the intended target region). Probes with an intermediate score (say, less than 5 or 10) may in some instances be accepted, and in some instances any probes above a particular score are discarded. Other cutoff values can be used for specific samples.
  • the hybridized probe-DNA fragment intermediates are pulled down (or isolated), and the targeted DNA is amplified and its methylation status is determined by, for example, sequencing or hybridization to a microarray, etc.
  • the sequence read provides information relevant for detection of cancer.
  • a panel is designed to include a plurality of probes that can capture fragments that can together provide information relevant to detection of cancer.
  • a panel includes at least 5, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1,200, 1,400, 1,600, 1,800, 2,000, 2,200, 2,400, 2,600, 2,800, 3,000, 3,200, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, or
  • a panel includes at least 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1,200, 1,400, 1,600, 1,800, 2,000, 2,200, 2,400, 2,600, 2,800, 3,000, 3,200, 4,000, 4,500, 5,000, 5,500, 6,000, 6,500, 7,000, 7,500, 8,000, 8,500, 9,000, 10,000, 15,000, or 20,000 probes.
  • the plurality of probes together can comprise at least 10,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000, 100,000, 120,000, 140,000, 160,000, 180,000, 200,000, 240,000, 260,000, 280,000, 300,000, 320,000, 400,000, 450,000, 500,000, 550,000, 600,000, 650,000, 700,000, 750,000, 800,000, 850,000, 900,000, 1 million, 1.5million,
  • the selected target genomic regions can be located in various positions in a genome, including but not limited to exons, introns, intergenic regions, and other parts.
  • probes targeting non-human genomic regions such as those targeting viral genomic regions, can be added.
  • primers may be used to specifically amplify targets/biomarkers of interest (e.g., by PCR), thereby enriching the sample for desired targets/biomarkers (optionally without hybridization capture).
  • forward and reverse primers can be prepared for each genomic region of interest and used to amplify fragments that correspond to or are derived from the desired genomic region.
  • additional or alternative methods are used for enrichment (e.g., non-targeted enrichment) such as reduced representation bisulfite sequencing, methylation restriction enzyme sequencing, methylation DNA immunoprecipitation sequencing, methyl-CpG-binding domain protein sequencing, methyl DNA capture sequencing, or microdroplet PCR.
  • the cancer assay panel provided herein is a panel including a set of hybridization probes (also referred to herein as“probes”) designed to, during enrichment, target and pull down nucleic acid fragments of interest for the assay.
  • the probes are designed to hybridize and enrich DNA or cfDNA molecules from cancerous samples that have been treated to convert unmethylated cytosines (C) to uracils (U).
  • the probes are designed to hybridize and enrich DNA or cfDNA molecules from cancerous samples of a TOO that have been treated to convert unmethylated cytosines (C) to uracils (U).
  • the probes can be designed to anneal (or hybridize) to a target (complementary) strand of DNA or RNA.
  • the target strand can be the“positive” strand (e.g., the strand transcribed into mRNA, and subsequently translated into a protein) or the complementary“negative” strand.
  • a cancer assay panel may include sets of two probes, one probe targeting the positive strand and the other probe targeting the negative strand of a target genomic region.
  • probes or probe sets can target either the“positive” or forward strand or its reverse complement (the“negative” strand). Additionally, in some embodiments, the probes or probe sets are designed to enrich DNA molecules or fragments that have been treated to convert unmethylated cytosines (C) to uracils (U).
  • C unmethylated cytosines
  • U uracils
  • the probes or probe sets are designed to enrich DNA molecules corresponding to, or derived from the targeted regions after conversion
  • the probe’s sequence can be designed to enrich DNA molecules of fragments where unmethylated C’s have been converted to U’s (by utilizing A’s in place of G’s at sites that are unmethylated cytosines in DNA molecules or fragments corresponding to, or derived from, the targeted region).
  • probes are designed to bind to, or hybridize to, DNA molecules or fragments from genomic regions known to contain cancer-specific methylation patterns (e.g., hypermethylated or hypomethylated DNA molecules), thereby enriching (or detecting) cancer-specific DNA molecules or fragments.
  • Targeting genomic regions, or cancer-specific methylation patterns can be advantageous allowing one to specifically enrich for DNA molecules or fragments identified as informative for cancer or cancer TOO, and thus, lowering detection needs and costs (e.g., lowering sequencing costs).
  • two probe sequences can be designed per a target genomic region (one for each DNA strand).
  • probes are designed to enrich for all DNA molecules or fragments corresponding to, or derived from, a targeted region (i.e., regardless of strand or methylation status).
  • the cancer methylation status is not highly methylated or unmethylated, or because the probes are designed to target small mutations or other variations rather than methylation changes, with these other variations similarly indicative of the presence or absence of a cancer or the presence or absence of a cancer of one or more TOOs. In that case, all four possible probe sequences can be included per a target genomic region.
  • the probes can range in length from 10s, 100s, 200s, or 300s of base pairs.
  • the probes can comprise at least 50, 75, 100, or 120 nucleotides.
  • the probes can comprise less than 300, 250, 200, or 150 nucleotides.
  • the probes comprise 100-150 nucleotides.
  • the probes comprise 120 nucleotides.
  • the probes are designed in a“2x tiled” fashion to cover overlapping portions of a target region. Each probe optionally overlaps in coverage at least partially with another probe in the library.
  • the panel contains multiple pairs of probes, with each probe in a pair overlapping the other by at least 25, 30, 35, 40, 45, 50, 60, 70, 75 or 100 nucleotides.
  • the overlapping sequence can be designed to be complementary to a target genomic region (or cfDNA derived therefrom) or to be complementary to a sequence with homology to a target region or cfDNA.
  • at least two probes are complementary to the same sequence within a target genomic region, and a nucleotide fragment corresponding to or derived from the target genomic region can be bound and pulled down by at least one of the probes.
  • Other levels of tiling are possible, such as 3x tiling, 4x tiling, etc., wherein each nucleotide in a target region can bind to more than two probes.
  • each base in a target genomic region is overlapped by exactly two probes, as illustrated in FIG. 1A.
  • a single pair of probes is enough to pull down a genomic region if the overlap between the two probes is longer than the target genomic region and extends beyond both ends of the target genomic region.
  • even relatively small target regions may be targeted with three probes (see FIG. 1A).
  • a probe set comprising three or more probes is optionally used to capture a larger genomic region (see FIG. IB).
  • subsets of probes will collectively extend across an entire genomic region (e.g., may be complementary to non-converted or converted fragments from the genomic region).
  • a tiled probe set optionally comprises probes that collectively include at least two probes that overlap every nucleotide in the genomic region. This is done to ensure that cfDNAs comprising a small portion of a target genomic region at one end will have a substantial overlap extending into the adjacent non-targeted genomic region with at least one probe, to provide for efficient capture.
  • a 100 bp cfDNA fragment comprising a 30 nt target genomic region can be guaranteed to have at least 65 bp overlap with at least one of the overlapping probes.
  • Other levels of tiling are possible.
  • probes can be designed to expand a 30 bp target region by at least 70 bp, 65 bp, 60 bp, 55 bp, or 50 bp.
  • the probes can be designed to extend past the ends of the target region on either side.
  • the probes are designed to analyze methylation status of target genomic regions (e.g., of the human or another organism) that are suspected to correlate with the presence or absence of cancer generally, presence or absence of certain types of cancers, cancer stage, or presence or absence of other types of diseases.
  • the probes are designed to effectively hybridize to and optionally pull down cfDNA fragments containing a target genomic region.
  • the probes are designed to cover overlapping portions of a target region, so that each probe is“tiled” in coverage such that each probe overlaps in coverage at least partially with another probe in the library.
  • the panel contains multiple pairs of probes, where each pair comprises at least two probes overlapping each other by an overlapping sequence of at least 25, 30, 35, 40, 45, 50, 60, 70, 75 or 100 nucleotides.
  • the overlapping sequence can be designed to be complementary to a target genomic region (or a converted version thereof), thus a nucleotide fragment derived from or containing the target genomic region can be bound and optionally pulled down by at least one of the probes.
  • the smallest target genomic region is 30bp.
  • the new target region of 30bp can be centered on a specific CpG site of interest. Then, it is checked whether each edge of this new target is close enough to other targets such that they can be merged. This is based on a“merge distance” parameter which can be 200bp by default but can be tuned. This allows close but distinct target regions to be enriched with overlapping probes.
  • the new target can be merged with nothing (increasing the number of panel targets by one), merged with just one target either to the left or the right (not changing the number of panel targets), or merged with existing targets both to the left and right (reducing the number of panel targets by one).
  • target genomic regions for detecting cancer and/or a TOO are provided.
  • the targeted genomic regions can be used to design and
  • Methylation status of DNA or cfDNA molecules corresponding to, or derived from, the target genomic regions can be screened using the cancer assay panel.
  • Alternative methods for example by WGBS or other methods known in the art, can be also implemented to detect methylation status of DNA molecules or fragments corresponding to, or derived from, the target genomic regions.
  • FIG. 7A is a flowchart of a process 100 for processing a nucleic acid sample and generating methylation state vectors for DNA fragments, according to one embodiment. While the present disclosure pays particular attention to sequencing based approaches for detecting nucleic acids and determining methylation status, the disclosure is broad enough to encompass other methods for determining methylation status of nucleic acid sequences (such as
  • the method includes, but is not limited to, the following steps.
  • any step of the method may comprise a quantitation sub-step for quality control or other laboratory assay procedures known to one skilled in the art.
  • a nucleic acid sample (DNA or RNA) is extracted from a subject.
  • DNA and RNA may be used interchangeably unless otherwise indicated.
  • the sample may be any subset of the human genome, including the whole genome.
  • the sample may include blood, plasma, serum, urine, fecal, saliva, other types of bodily fluids, or any combination thereof.
  • methods for drawing a blood sample e.g., syringe or finger prick
  • the extracted sample may comprise cfDNA and/or ctDNA.
  • the human body may naturally clear out cfDNA and other cellular debris. If a subject has a cancer or disease, cfDNA and/or ctDNA in an extracted sample may be present at a detectable level for detecting the cancer or disease.
  • the cfDNA fragments are treated to convert unmethylated cytosines to uracils.
  • the method uses a bisulfite treatment of the DNA which converts the unmethylated cytosines to uracils without converting the methylated cytosines.
  • a commercial kit such as the EZ DNA MethylationTM - Gold, EZ DNA MethylationTM - Direct or an EZ DNA MethylationTM - Lightning kit (available from Zymo Research Corp (Irvine, CA)) is used for the bisulfite conversion.
  • the conversion of unmethylated cytosines to uracils is accomplished using an enzymatic reaction.
  • the conversion can use a commercially available kit for conversion of unmethylated cytosines to uracils, such as APOBEC-Seq (NEBiolabs, Ipswich, MA).
  • a sequencing library is prepared.
  • a ssDNA adapter is added to the 3'-OH end of a bisulfite-converted ssDNA molecule using a ssDNA ligation reaction.
  • the ssDNA ligation reaction uses CircLigase II (Epicentre) to ligate the ssDNA adapter to the 3 '-OH end of a bisulfite-converted ssDNA molecule, wherein the 5 '-end of the adapter is phosphorylated and the bi sulfite-converted ssDNA has been dephosphorylated (i.e., the 3' end has a hydroxyl group).
  • the ssDNA ligation reaction uses Thermostable 5' AppDNA/RNA ligase (available from New England BioLabs (Ipswich, MA)) to ligate the ssDNA adapter to the 3'-OH end of a bisulfite-converted ssDNA molecule.
  • the first UMI adapter is adenylated at the 5 '-end and blocked at the 3 '-end.
  • the ssDNA ligation reaction uses a T4 RNA ligase (available from New England BioLabs) to ligate the ssDNA adapter to the 3'-OH end of a bisulfite-converted ssDNA molecule.
  • a second strand DNA is synthesized in an extension reaction.
  • an extension primer that hybridizes to a primer sequence included in the ssDNA adapter, is used in a primer extension reaction to form a double-stranded bi sulfite-converted DNA molecule.
  • the extension reaction uses an enzyme that is able to read through uracil residues in the bi sulfite-converted template strand.
  • a dsDNA adapter is added to the double-stranded bi sulfite-converted DNA molecule.
  • the double-stranded bi sulfite-converted DNA is amplified to add sequencing adapters. For example, PCR amplification using a forward primer that includes a P5 sequence and a reverse primer that includes a P7 sequence is used to add P5 and P7 sequences to the bi sulfite-converted DNA.
  • UMI unique molecular identifiers
  • the UMIs are short nucleic acid sequences (e.g., 4-10 base pairs) that are added to ends of DNA fragments during adapter ligation.
  • UMIs are degenerate base pairs that serve as a unique tag that can be used to identify sequence reads originating from a specific DNA fragment.
  • the UMIs are replicated along with the attached DNA fragment, which provides a way to identify sequence reads that came from the same original fragment in downstream analysis.
  • targeted DNA sequences may be enriched from the library. This is used, for example, where a targeted panel assay is being performed on the samples.
  • hybridization probes also referred to herein as“probes” are used to target, and pull down, nucleic acid fragments informative for the presence or absence of cancer (or disease), cancer status, or a cancer classification (e.g., cancer type or tissue of origin).
  • the probes may be designed to anneal (or hybridize) to a target (complementary) strand of DNA or RNA.
  • the target strand may be the“positive” strand (e.g., the strand transcribed into mRNA, and subsequently translated into a protein) or the complementary“negative” strand.
  • the probes may range in length from 10s, 100s, or 1000s of base pairs. Moreover, the probes may cover overlapping portions of a target region.
  • the hybridized nucleic acid fragments are captured and may also be amplified using PCR (enrichment 125).
  • the target sequences can be enriched to obtain enriched sequences that can be subsequently sequenced.
  • any known method in the art can be used to isolate, and enrich for, probe-hybridized target nucleic acids.
  • a biotin moiety can be added to the 5 '-end of the probes (i.e., biotinylated) to facilitate isolation of target nucleic acids hybridized to probes using a streptavidin-coated surface (e.g., streptavidin-coated beads).
  • sequence reads are generated from the enriched DNA sequences, e.g., enriched sequences.
  • Sequence data may be acquired from the enriched DNA sequences by known means in the art.
  • the method may include next generation sequencing (NGS) techniques including synthesis technology (Illumina), pyrosequencing (454 Life).
  • NGS next generation sequencing
  • methylation-aware sequencing see e.g., WO 2014/043763
  • a DNA microarray e.g., with labeled probes adhered or conjugated to a solid surface or DNA array chip, etc.
  • methylation state vectors are generated from the sequence reads.
  • a sequence read is aligned to a reference genome.
  • the reference genome helps provide the context as to what position in a human genome the fragment cfDNA originates from.
  • the sequence read is aligned such that the three CpG sites correlate to CpG sites 23, 24, and 25 (arbitrary reference identifiers used for convenience of description).
  • a methylation state vector may be generated for the fragment cfDNA.
  • FIG. 3A is a flowchart describing a process 300 of generating a data structure for a healthy control group, according to an embodiment.
  • the analytics system obtains information related to methylation status of a plurality of CpG sites on sequence reads derived from a plurality of DNA molecules or fragments from a plurality of healthy subjects.
  • the method provided herein for creating a healthy control group data structure can be performed similarly for subjects with cancer, subjects with cancer of a TOO, subjects with a known cancer type, or subjects with another known disease state.
  • a methylation state vector is generated for each DNA molecule or fragment, for example via the process 100.
  • the analytics system subdivides 310 the methylation state vector into strings of CpG sites.
  • the analytics system subdivides 310 the methylation state vector such that the resulting strings are all less than a given length.
  • a methylation state vector of length 11 may be subdivided into strings of length less than or equal to 3 would result in 9 strings of length 3, 10 strings of length 2, and 11 strings of length 1.
  • a methylation state vector of length 7 being subdivided into strings of length less than or equal to 4 would result in 4 strings of length 4, 5 strings of length 3, 6 strings of length 2, and 7 strings of length 1.
  • the methylation state vector may be converted into a single string containing all of the CpG sites of the vector.
  • the analytics system tallies 320 the strings by counting, for each possible CpG site and possibility of methylation states in the vector, the number of strings present in the control group having the specified CpG site as the first CpG site in the string and having that possibility of methylation states. For example, at a given CpG site and considering string lengths of 3, there are 2 L 3 or 8 possible string configurations. At that given CpG site, for each of the 8 possible string configurations, the analytics system tallies 320 how many occurrences of each methylation state vector possibility come up in the control group.
  • this may involve tallying the following quantities: ⁇ M x , M x+i , M x +2 >, ⁇ M x , M x+i , U x +2 >, . . ., ⁇ U x , U x+i , U x +2 > for each starting CpG site x in the reference genome.
  • the analytics system creates 330 the data structure storing the tallied counts for each starting CpG site and string possibility.
  • a statistical consideration to limiting the maximum string length is to avoid overfitting downstream models that use the string counts. If long strings of CpG sites do not, biologically, have a strong effect on the outcome (e.g., predictions of anomalousness that predictive of the presence of cancer), calculating probabilities based on large strings of CpG sites can be problematic as it requires a significant amount of data that may not be available, and thus would be too sparse for a model to perform appropriately. For example, calculating a probability of anomalousness/cancer conditioned on the prior 100 CpG sites would require counts of strings in the data structure of length 100, ideally some matching exactly the prior 100 methylation states. If only sparse counts of strings of length 100 are available, there will be insufficient data to determine whether a given string of length of 100 in a test sample is anomalous or not. Validation of data structure
  • the analytics system may seek to validate 340 the data structure and/or any downstream models making use of the data structure.
  • One type of validation checks consistency within the control group’s data structure. For example, if there are any outlier subjects, samples, and/or fragments within a control group, then the analytics system may perform various calculations to determine whether to exclude any fragments from one of those categories.
  • the healthy control group may contain a sample that is undiagnosed but cancerous such that the sample contains anomalously methylated fragments. This first type of validation ensures that potential cancerous samples are removed from the healthy control group so as to not affect the control group’s purity.
  • a second type of validation checks the probabilistic model used to calculate p-values with the counts from the data structure itself (i.e., from the healthy control group).
  • a process for p-value calculation is described below in conjunction with FIG. 5.
  • the analytics system Once the analytics system generates a p-value for the methylation state vectors in the validation group, the analytics system builds a cumulative density function (CDF) with the p-values. With the CDF, the analytics system may perform various calculations on the CDF to validate the control group’s data structure.
  • CDF cumulative density function
  • a third type of validation uses a healthy set of validation samples separate from those used to build the data structure, which tests if the data structure is properly built and the model works. An example process for carrying out this type of validation is described below in conjunction with FIG. 3B.
  • the third type of validation can quantify how well the healthy control group generalizes the distribution of healthy samples. If the third type of validation fails, then the healthy control group does not generalize well to the healthy distribution.
  • FIG. 3B is a flowchart describing the additional step 340 of validating the data structure for the control group of FIG. 3A, according to an embodiment.
  • the analytics system performs the fourth type of validation test as described above which utilizes a validation group with a supposedly similar composition of subjects, samples, and/or fragments as the control group. For example, if the analytics system selected healthy subjects without cancer for the control group, then the analytics system also uses healthy subjects without cancer in the validation group.
  • the analytics system takes the validation group and generates 100 a set of methylation state vectors as described in FIG. 3A.
  • the analytics system performs a p-value calculation for each methylation state vector from the validation group.
  • the p-value calculation process will be further described in conjunction with FIGS. 4-5.
  • the analytics system calculates a probability from the control group’s data structure. Once the probabilities are calculated for the possibilities of methylation state vectors, the analytics system calculates 350 a p-value score for that methylation state vector based on the calculated probabilities.
  • the p-value score represents an expectedness of finding that specific methylation state vector and other possible methylation state vectors having even lower probabilities in the control group.
  • a low p-value score thereby, generally corresponds to a methylation state vector which is relatively unexpected in comparison to other methylation state vectors within the control group, where a high p-value score generally corresponds to a methylation state vector which is relatively more expected in comparison to other methylation state vectors found in the control group.
  • the analytics system builds 360 a cumulative density function (CDF) with the p-value scores from the validation group.
  • the analytics system validates 370 consistency of the CDF as described above in the fourth type of validation tests.
  • Anomalously methylated fragments having abnormal methylation patterns in cancer patient samples, subject with cancer of a TOO, subjects with a known cancer type, or subjects with another known disease state, are selected as target genomic regions, according to an embodiment as outlined in FIG. 4. Exemplary processes of selected anomalously methylated fragments 440 are visually illustrated in FIG. 5, and is further described below the description of FIG. 4.
  • the analytics system generates 100 methylation state vectors from cfDNA fragments of the sample.
  • the analytics system handles each methylation state vector as follows.
  • the analytics system enumerates 410 all possibilities of methylation state vectors having the same starting CpG site and same length (i.e., set of CpG sites) in the methylation state vector.
  • each methylation state may be methylated or unmethylated there are only two possible states at each CpG site, and thus the count of distinct possibilities of methylation state vectors depends on a power of 2, such that a methylation state vector of length n would be associated with 2n possibilities of methylation state vectors.
  • the analytics system calculates 420 the probability of observing each possibility of methylation state vector for the identified starting CpG site / methylation state vector length by accessing the healthy control group data structure.
  • calculating the probability of observing a given possibility uses a Markov chain probability to model the joint probability calculation which will be described in greater detail with respect to FIG. 5 below.
  • calculation methods other than Markov chain probabilities are used to determine the probability of observing each possibility of methylation state vector.
  • the analytics system calculates 430 a p-value score for the methylation state vector using the calculated probabilities for each possibility. In one embodiment, this includes identifying the calculated probability corresponding to the possibility that matches the methylation state vector in question. Specifically, this is the possibility having the same set of CpG sites, or similarly the same starting CpG site and length as the methylation state vector. The analytics system sums the calculated probabilities of any possibilities having probabilities less than or equal to the identified probability to generate the p-value score.
  • This p-value represents the probability of observing the methylation state vector of the fragment or other methylation state vectors even less probable in the healthy control group.
  • a low p-value score thereby, generally corresponds to a methylation state vector which is rare in a healthy subject, and which causes the fragment to be labeled abnormally methylated, relative to the healthy control group.
  • a high p-value score generally relates to a methylation state vector is expected to be present, in a relative sense, in a healthy subject. If the healthy control group is a non-cancerous group, for example, a low p-value indicates that the fragment is abnormally methylated relative to the non-cancer group, and therefore possibly indicative of the presence of cancer in the test subject.
  • the analytics system calculates p-value scores for each of a plurality of methylation state vectors, each representing a cfDNA fragment in the test sample.
  • the analytics system may filter 440 the set of methylation state vectors based on their p-value scores. In one embodiment, filtering is performed by comparing the p-values scores against a threshold and keeping only those fragments below the threshold. This threshold p-value score could be on the order of 0.1, 0.01, 0.001, 0.0001, or similar.
  • FIG. 5 is an illustration 500 of an example p-value score calculation, according to an embodiment.
  • the analytics system takes that test methylation state vector 505 and enumerates 410 possibilities of methylation state vectors.
  • the test methylation state vector 505 is ⁇ M23, M24, M25, U26 >.
  • the length of the test methylation state vector 505 is 4, there are 2 L 4 possibilities of methylation state vectors encompassing CpG sites 23 - 26.
  • the number of possibilities of methylation state vectors is 2 L h, where n is the length of the test methylation state vector or alternatively the length of the sliding window (described further below).
  • the analytics system calculates 420 probabilities 515 for the enumerated possibilities of methylation state vectors.
  • methylation is conditionally dependent on methylation status of nearby CpG sites, one way to calculate the probability of observing a given methylation state vector possibility is to use Markov chain model.
  • a methylation state vector such as ⁇ Si, S2, . . . , S n >, where S denotes the methylation state whether methylated (denoted as M), unmethylated (denoted as U), or indeterminate (denoted as I), has a joint probability that can be expanded using the chain rule of probabilities as:
  • the analytics system selects a Markov chain order k which corresponds to how many prior CpG sites in the vector (or window) to consider in the conditional probability calculation, such that the conditional probability is modeled as P(S n
  • the analytics system accesses the control group’s data structure, specifically the counts of various strings of CpG sites and states.
  • S n -k-2, ..., S n -i ) the analytics system takes a ratio of the stored count of the number of strings from the data structure matching ⁇ S n-k - 2, ... , S n -i, M n > divided by the sum of the stored count of the number of strings from the data structure matching ⁇ S n -k-2, ... , S n -i, M n > and ⁇ S n -k-2, ...
  • the calculation may additionally implement a smoothing of the counts by applying a prior distribution.
  • the prior distribution is a uniform prior as in Laplace smoothing.
  • a constant is added to the numerator and another constant (e.g., twice the constant in the numerator) is added to the denominator of the above equation.
  • an algorithmic technique such as Knesser-Ney smoothing is used.
  • the analytics system calculates 430 a p-value score 525 that sums the probabilities that are less than or equal to the probability of possibility of methylation state vector matching the test methylation state vector 505
  • the computational burden of calculating probabilities and/or p-value scores may be further reduced by caching at least some calculations.
  • the analytic system may cache in transitory or persistent memory calculations of probabilities for possibilities of methylation state vectors (or windows thereof). If other fragments have the same CpG sites, caching the possibility probabilities allows for efficient calculation of p-value scores without needing to re-calculate the underlying possibility probabilities.
  • the analytics system may calculate p-value scores for each of the possibilities of methylation state vectors associated with a set of CpG sites from vector (or window thereof). The analytics system may cache the p-value scores for use in determining the p-value scores of other fragments including the same CpG sites.
  • the p-value scores of possibilities of methylation state vectors having the same CpG sites may be used to determine the p-value score of a different one of the possibilities from the same set of CpG sites.
  • the analytics system uses 435 a sliding window to determine possibilities of methylation state vectors and calculate p-values. Rather than enumerating possibilities and calculating p-values for entire methylation state vectors, the analytics system enumerates possibilities and calculates p-values for only a window of sequential CpG sites, where the window is shorter in length (of CpG sites) than at least some fragments (otherwise, the window would serve no purpose).
  • the window length may be static, user determined, dynamic, or otherwise selected.
  • the window In calculating p-values for a methylation state vector larger than the window, the window identifies the sequential set of CpG sites from the vector within the window starting from the first CpG site in the vector.
  • the analytic system calculates a p-value score for the window including the first CpG site.
  • the analytics system then“slides” the window to the second CpG site in the vector, and calculates another p-value score for the second window.
  • each methylation state vector will generate m l+l p-value scores.
  • the analytics system aggregates the p-value scores for the methylation state vectors to generate an overall p-value score.
  • Each of the 50 calculations enumerates 2 L 5 (32) possibilities of methylation state vectors, which total results in 50> ⁇ 2 L 5 (1.6 c 10 L 3) probability calculations. This results in a vast reduction of calculations to be performed, with no meaningful hit to the accurate identification of anomalous fragments.
  • This additional step can also be applied when validating 340 the control group with the validation group’s methylation state vectors. Identifying fragments indicative of cancer
  • the analytics system identifies 450 DNA fragments indicative of cancer from the filtered set of anomalously methylated fragments.
  • the analytics system may identify DNA fragments that are deemed hypomethylated or hypermethylated as fragments indicative of cancer from the filtered set of anomalously methylated fragments.
  • Hypomethylated and hypermethylated fragments can be defined as fragments of a certain length of CpG sites (e.g., more than 3, 4, 5, 6, 7, 8, 9, 10, etc.) with a high percentage of methylated CpG sites (e.g., more than 80%, 85%, 90%, or 95%, or any other percentage within the range of 50%-100%) or a high percentage of unmethylated CpG sites (e.g., more than 80%, 85%, 90%, or 95%, or any other percentage within the range of 50%-100%).
  • the analytics system identifies fragments indicative of cancer utilizing probabilistic models of methylation patterns fitted to each cancer type and non-cancer type.
  • the analytics system calculates log-likelihood ratios for a sample using DNA fragments in the genomic regions considering the various cancer types with the fitted probabilistic models for each cancer type and non-cancer type.
  • the analytics system may determine a DNA fragment to be indicative of cancer based on whether at least one of the log- likelihood ratios considered against the various cancer types is above a threshold value.
  • the analytics system partitions the genome into regions by multiple stages.
  • the analytics system separates the genome into blocks of CpG sites. Each block is defined when there is a separation between two adjacent CpG sites that exceeds some threshold, e.g., greater than 200 bp, 300 bp, 400 bp, 500 bp, 600 bp, 700 bp, 800 bp, 900 bp, or 1,000 bp.
  • the analytics system subdivides at a second stage each block into regions of a certain length, e.g., 500 bp, 600 bp, 700 bp, 800 bp, 900 bp, 1,000 bp, 1,100 bp, 1,200 bp, 1,300 bp, 1,400 bp, or 1,500 bp.
  • the analytics system may further overlap adjacent regions by a percentage of the length, e.g., 10%, 20%, 30%, 40%, 50%, or 60%.
  • the analytics system analyzes sequence reads derived from DNA fragments for each region.
  • the analytics system may process samples from tissue and/or high-signal cfDNA.
  • High- signal cfDNA samples may be determined by a binary classification model, by cancer stage, or by another metric.
  • each probabilistic model is mixture model comprising a combination of a plurality of mixture components with each mixture component being an independent-sites model where methylation at each CpG site is assumed to be independent of methylation statuses at other CpG sites.
  • calculation is performed with respect to each CpG site.
  • a first count is determined that is the number of cancerous samples (cancer count) that include an anomalously methylated DNA fragment overlapping that CpG
  • a second count is determined that is the total number of samples containing fragments overlapping that CpG (total) in the set.
  • Genomic regions can be selected based on the numbers, for example, based on criteria positively correlated to the number of cancerous samples (cancer count) that include a DNA fragment overlapping that CpG, and inversely correlated to the total number of samples containing fragments overlapping that CpG (total) in the set.
  • Cancer of various types having different TOO can be selected from the group consisting of: breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of renal pelvis, renal cancer other than urothelial, prostate cancer, anorectal cancer, anal cancer, colorectal cancer, hepatobiliary cancer arising from hepatocytes, hepatobiliary cancer arising from cells other than hepatocytes, liver/bile-duct cancer, esophageal cancer, pancreatic cancer, squamous cell cancer of the upper gastrointestinal tract, upper gastrointestinal cancer other than squamous, head and neck cancer, lung cancer, lung adenocarcinoma, small cell lung cancer, squamous cell lung cancer and cancer other than adenocarcinoma or small cell lung cancer, neuroendocrine cancer, melanoma, thyroid cancer, sarcoma, plasma cell neoplasm, multiple myeloma, myeloid
  • cancer types can be classified and labeled using classification methods available in the art, such as the International Classification of Diseases for Oncology (ICD-O-3) (codes.iarc.fr) or the Surveillance, Epidemiology, and End Results Program (SEER) (seer.cancer.gov).
  • cancer types are classified in three orthogonal codes, (i) topographical codes, (ii) morphological codes, or (iii) behavioral codes. Under behavioral codes, benign tumor is 0, uncertain behavior is 1, carcinoma in situ is 2, malignant, primary site is 3 and malignant, metastatic site is 6.
  • a cancer TOO can be selected from a group defined by the guideline that will be used to stage a detected cancer.
  • the reference Amin, M.B., Edge, S., Greene, F., Byrd, D.R., Brookland, R.K., Washington, M.K., Gershenwald, J.E., Compton, C.C., Hess, K.R., Sullivan, D.C., Jessup, J.M., Brierley, J.D., Gaspar, L.E., Schilsky, R.L., Balch, C.M., Winchester, D.P., Asare, E.A., Madera, M., Gress, D.M., Meyer, L.R. (Eds.), AJCC Cancer Staging Manual, 8th edition, Springer, 2017, identifies groups of different cancers that are staged together following standard guidelines, yypically, such staging is a next step
  • the analytics system can further calculate log-likelihood ratios (“R”) for a fragment indicating a likelihood of the fragment being indicative of cancer considering the various cancer types with the fitted probabilistic models for each cancer type and non-cancer type, or for a cancer TOO.
  • the two probabilities may be taken from probabilistic models fitted for each of the cancer types and the non-cancer type, the probabilistic models defined to calculate a likelihood of observing a methylation pattern on a fragment given each of the cancer types and the non cancer type.
  • the probabilistic models may be defined fitted for each of the cancer types and the non-cancer type.
  • the analytics system identifies 460 genomic regions indicative of cancer. To identify these informative regions, the analytics system calculates an information gain for each genomic region or more specifically each CpG site that describes an ability to distinguish between various outcomes.
  • a method for identifying genomic regions capable of distinguishing between cancer type and non-cancer type utilizes a trained classification model that can be applied on the set of anomalously methylated DNA molecules or fragments corresponding to, or derived from a cancerous or non-cancerous group.
  • the trained classification model can be trained to identify any condition of interest that can be identified from the methylation state vectors.
  • the trained classification model is a binary classifier trained based on methylation states for cfDNA fragments or genomic sequences obtained from a subject cohort with cancer or a cancer TOO, and a healthy subject cohort without cancer, and is then used to classify a test subject probability of having cancer, a cancer TOO, or not having cancer, based on anomalously methylation state vectors.
  • different classifiers may be trained using subject cohorts known to have particular cancer (e.g., breast, lung, prostrate, etc.); known to have cancer of particular TOO where the cancer is believed to originate; or known to have different stages of particular cancer (e.g., breast, lung, prostrate, etc.). In these, use of particular cancer (e.g., breast, lung, prostrate, etc.).
  • different classifiers may be trained using sequence reads obtained from samples enriched for tumor cells from subject cohorts known to have particular cancer (e.g., breast, lung, prostrate, etc.).
  • Each genomic region’s ability to distinguish between cancer type and non-cancer type in the classification model is used to rank the genomic regions from most informative to least informative in classification performance.
  • the analytics system may identify genomic regions from the ranking according to information gain in classification between non-cancer type and cancer type.
  • the analytics system may train a classifier according to a process 600 illustrated in FIG. 6A, according to an embodiment.
  • the process 600 accesses two training groups of samples - a non-cancer group and a cancer group - and obtains 605 a non-cancer set of methylation state vectors and a cancer set of methylation state vectors comprising anomalously methylated fragments, e.g., via step 440 from the process 400.
  • the analytics system determines 610, for each methylation state vector, whether the methylation state vector is indicative of cancer.
  • fragments indicative of cancer may be defined as hypermethylated or hypomethylated fragments determined if at least some number of CpG sites have a particular state (methylated or unmethylated, respectively) and/or have a threshold percentage of sites that are the particular state (again, methylated or unmethylated, respectively).
  • cfDNA fragments are identified as hypomethylated or
  • hypermethylated respectively, if the fragment overlaps at least 5 CpG sites, and at least 80%, 90%, or 100% of its CpG sites are methylated or at least 80%, 90%, or 100% are unmethylated.
  • the process considers portions of the methylation state vector and determines whether the portion is hypomethylated or hypermethylated, and may distinguish that portion to be hypomethylated or hypermethylated. This alternative resolves missing methylation state vectors which are large in size but contain at least one region of dense hypomethylation or hypermethylation. This process of defining hypomethylation and
  • hypermethylation can be applied in step 450 of FIG. 4.
  • the fragments indicative of cancer may be defined according to likelihoods outputted from trained probabilistic models.
  • the analytics system generates 620 a hypomethylation score (Phypo) and a hypermethylation score (Phyper) per CpG site in the genome.
  • the classifier takes four counts at that CpG site - (1) count of (m ethylations state) vectors of the cancer set labeled hypomethylated that overlap the CpG site; (2) count of vectors of the cancer set labeled hypermethyl ated that overlap the CpG site; (3) count of vectors of the non-cancer set labeled hypomethylated that overlap the CpG site; and (4) count of vectors of the non-cancer set labeled hypermethylated that overlap the CpG site.
  • the process may normalize these counts for each group to account for variance in group size between the non-cancer group and the cancer group.
  • the scores may be more broadly defined as counts of fragments indicative of cancer at each genomic region and/or CpG site.
  • the process takes a ratio of (1) over (1) summed with (3).
  • the hypermethylation score is calculated by taking a ratio of (2) over (2) and (4). Additionally, these ratios may be calculated with an additional smoothing technique as discussed above.
  • the hypomethylation score and the hypermethylation score relate to an estimate of cancer probability given the presence of hypomethylation or hypermethylation of fragments from the cancer set.
  • the analytics system generates 630 an aggregate hypomethylation score and an aggregate hypermethylation score for each anomalous methylation state vector.
  • the aggregate hyperand hypo methylation scores are determined based on the hyper and hypo methylation scores of the CpG sites in the methylation state vector.
  • the aggregate hyper and hypo methylation scores are assigned as the largest hyper and hypo methylation scores of the sites in each state vector, respectively.
  • the aggregate scores could be based on means, medians, or other calculations that use the hyper/hypo methylation scores of the sites in each vector.
  • the analytics system ranks 640 all of that subject’s methylation state vectors by their aggregate hypomethylation score and by their aggregate hypermethylation score, resulting in two rankings per subject.
  • the process selects aggregate hypomethylation scores from the
  • the classifier With the selected scores, the classifier generates 650 a single feature vector for each subject.
  • the scores selected from either ranking are selected with a fixed order that is the same for each generated feature vector for each subject in each of the training groups.
  • the classifier selects the first, the second, the fourth, and the eighth aggregate hyper methylation score, and similarly for each aggregate hypo methylation score, from each ranking and writes those scores in the feature vector for that subject.
  • the analytics system trains 660 a binary classifier to distinguish feature vectors between the cancer and non-cancer training groups.
  • a binary classifier to distinguish feature vectors between the cancer and non-cancer training groups.
  • the classifier is a non-linear classifier.
  • the classifier is a non-linear classifier utilizing a L2 -regularized kernel logistic regression with a Gaussian radial basis function (RBF) kernel.
  • RBF Gaussian radial basis function
  • the number of non-cancer samples or different cancer type(s) (i ) and the number of cancer samples or cancer type(s) (n . ,) having an anomalously methylated fragment overlapping a CpG site are counted. Then the probability that a sample is cancer is estimated by a score (“S”) that positively correlates to n . , and inversely correlated to i .
  • the score can be calculated using the equation: (n . , + 1) / (n,.. , + i + 2) or (n . ,) / (n . , + n * ).
  • the analytics system computes 670 an information gain for each cancer type and for each genomic region or CpG site to determine whether the genomic region or CpG site is indicative of cancer.
  • the information gain is computed for training samples with a given cancer type compared to all other samples.
  • two random variables ‘anomalous fragment’ (‘AF’) and‘cancer type’ (‘CT’) are used.
  • AF is a binary variable indicating whether there is an anomalous fragment overlapping a given CpG site in a given samples as determined for the anomaly score / feature vector above.
  • CT is a random variable indicating whether the cancer is of a particular type.
  • the analytics system computes the mutual information with respect to CT given AF. That is, how many bits of information about the cancer type are gained if it is known whether there is an anomalous fragment overlapping a particular CpG site.
  • the analytics system uses this information to rank CpG sites based on how cancer specific they are. This procedure is repeated for all cancer types under consideration. If a particular region is commonly anomalously methylated in training samples of a given cancer but not in training samples of other cancer types or in healthy training samples, then CpG sites overlapped by those anomalous fragments will tend to have high information gains for the given cancer type.
  • the ranked CpG sites for each cancer type are greedily added (selected) to a selected set of CpG sites based on their rank for use in the cancer classifier. Computing pairwise information gain from fragments indicative of cancer identified from probabilistic models
  • the analytics may identify genomic regions according to the process 680 in FIG. 6B.
  • the analytics system defines 690 a feature vector for each sample, for each region, for each cancer type by a count of DNA fragments that have a calculated log-likelihood ratio that the fragment is indicative of cancer above a plurality of thresholds, wherein each count is a value in the feature vector.
  • the analytics system counts the number of fragments present in a sample at a region for each cancer type with log-likelihood ratios above one or a plurality of possible threshold values.
  • the analytics system defines a feature vector for each sample, by a count of DNA fragments for each genomic region for each cancer type that provides a calculated log-likelihood ratio for the fragment above a plurality of thresholds, wherein each count is a value in the feature vector.
  • the analytics system uses the defined feature vectors to calculate an informative score for each genomic region describing that genomic region’s ability to distinguish between each pair of cancer types. For each pair of cancer types, the analytics system ranks regions based on the informative scores. The analytics system may select regions based on the ranking according to informative scores.
  • the analytics system calculates 695 an informative score for each region describing that region’s ability to distinguish between each pair of cancer types.
  • the analytics system may specify one type as a positive type and the other as a negative type.
  • a region’s ability to distinguish between the positive type and the negative type is based on mutual information, calculated using the estimated fraction of cfDNA samples of the positive type and of the negative type for which the feature would be expected to be non-zero in the final assay, i.e., at least one fragment of that tier that would be sequenced in a targeted methylation assay.
  • Those fractions are estimated using the observed rates at which the feature occurs in healthy cfDNA, and in high-signal cfDNA and/or tumor samples of each cancer type. For example, if a feature occurs frequently in healthy cfDNA, then it will also be estimated to occur frequently in cfDNA of any cancer type, and would likely result in a low informative score.
  • the analytics system may choose a certain number of regions for each pair of cancer types from the ranking, e.g., 1024.
  • the analytics system further identifies predominantly hypermethylated or hypomethylated regions from the ranking of regions.
  • the analytics system may load the set of fragments in the positive type(s) for a region that was identified as informative.
  • the analytics system evaluates whether the loaded fragments are predominantly hypermethylated or hypomethylated. If the loaded fragments are predominately hypermethylated or hypomethylated, the analytics system may select probes corresponding to the predominant methylation pattern. If the loaded fragments are not predominantly hypermethylated or hypomethylated, the analytics system may use a mixture of probes for targeting both hypermethylation and hypomethylation.
  • the analytics system may further identify a minimal set of CpG sites that overlap more than some percentage of the fragments.
  • the analytics system after ranking the regions based on informative scores, labels each region with the lowest informative ranking across all pairs of cancer types. For example, if a region was the lOth-most-informative region for distinguishing breast from lung, and the 5th-most-informative for distinguishing breast from colorectal, then it would be given an overall label of“5”.
  • the analytics system may design probes starting with the lowest-labeled regions while adding regions to the panel, e.g., until the panel’s size budget has been exhausted.
  • probes targeting selected genomic regions are further filtered 475 based on the number of their off-target regions. This is for screening probes that pull down too many cfDNA fragments corresponding to, or derived from, off-target genomic regions.
  • An off-target genomic region is a genomic region that has sufficient homology to a target genomic region, such that DNA molecules or fragments derived from off-target genomic regions are hybridized to and pulled down by a probe designed to hybridize to a target genomic region.
  • An off-target genomic region can be a genomic region (or a converted sequence of that same region) that aligns to a probe along at least 35bp, 40bp, 45bp, 50bp, 60bp, 70bp, or 80bp with at least an 80%, 85%, 90%, 95%, or 97% match rate.
  • an off-target genomic region is a genomic region (or a converted sequence of that same region) that aligns to a probe along at least 45bp with at least a 90% match rate.
  • Various methods known in the art can be adopted to screen off-target genomic regions.
  • a k-mer seeding strategy (which can allow one or more mismatches) is combined to local alignment at the seed locations.
  • exhaustive searching of good alignments can be guaranteed based on k-mer length, number of mismatches allowed, and number of k-mer seed hits at a particular location.
  • This requires doing dynamic programing local alignment at a large number of locations, so this approach is highly optimized to use vector CPU instructions (e.g., AVX2, AVX512) and also can be parallelized across many cores within a machine and also across many machines connected by a network.
  • vector CPU instructions e.g., AVX2, AVX512
  • probes having sequence homology with off-target genomic regions, or DNA molecules corresponding to, or derived from off-target genomic regions comprising more than a threshold number are excluded (or filtered) from the panel.
  • probes having sequence homology with off-target genomic regions, or DNA molecules corresponding to, or derived from off-target genomic regions from more than 30, more than 25, more than 20, more than 18, more than 15, more than 12, more than 10, or more than 5 off-target regions are excluded.
  • probes are divided into 2, 3, 4, 5, 6, or more separate groups depending on the numbers of off-target regions. For example, probes having sequence homology with no off-target regions or DNA molecules corresponding to, or derived from off- target regions are assigned to high-quality group, probes having sequence homology with 1-18 off-target regions or DNA molecules corresponding to, or derived from 1-18 off-target regions, are assigned to low-quality group, and probes having sequence homology with more than 19 off- target regions or DNA molecules corresponding to, or derived from 19 off-target regions, are assigned to poor-quality group. Other cut-off values can be used for the grouping.
  • probes in the lowest quality group are excluded. In some embodiments, probes in groups other than the highest-quality group are excluded. In some embodiments, separate panels are made for the probes in each group. In some embodiments, all the probes are put on the same panel, but separate analysis is performed based on the assigned groups.
  • a panel comprises a larger number of high-quality probes than the number of probes in lower groups. In some embodiments, a panel comprises a smaller number of poor-quality probes than the number of probes in other group. In some embodiments, more than 95%, 90%, 85%, 80%, 75%, or 70% of probes in a panel are high-quality probes. In some embodiments, less than 35%, 30%, 20%, 10%, 5%, 4%, 3%, 2% or 1% of the probes in a panel are low-quality probes. In some embodiments, less than 5%, 4%, 3%, 2% or 1% of the probes in a panel are poor-quality probes. In some embodiments, no poor-quality probes are included in a panel.
  • probes having below 50%, below 40%, below 30%, below 20%, below 10% or below 5% are excluded. In some embodiments, probes having above 30%, above 40%, above 50%, above 60%, above 70%, above 80%, or above 90% are selectively included in a panel. Methods of using cancer assay panel
  • methods of using a cancer assay panel comprise steps of treating DNA molecules or fragments to convert unmethylated cytosines to uracils (e.g., using bisulfite treatment), applying a cancer panel (as described herein) to the converted DNA molecules or fragments, enriching a subset of converted DNA molecules or fragments that hybridize (or bind) to the probes in the panel, and detecting the nucleic acid sequence and determining the methylation status thereof, for example, by sequencing the enriched cfDNA fragments.
  • the sequence reads can be compared to a reference genome (e.g., a human reference genome), allowing for identification of methylation states at a plurality of CpG sites within the DNA molecules or fragments and thus provide information relevant to detecting cancer.
  • a reference genome e.g., a human reference genome
  • the present disclosure pays particular attention to sequencing based approaches for detecting nucleic acids and determining methylation status thereof (via sequence reads), the disclosure is broad enough to encompass other methods for detecting nucleic acids and determining methylation status thereof (such as other methylation- aware sequencing approaches (e.g., as described in WO 2014/043763, which is incorporated herein by reference), DNA microarrays (e.g., with labeled probes adhered or conjugated to a solid surface or DNA array chip), etc.
  • methylation- aware sequencing approaches e.g., as described in WO 2014/043763, which is incorporated herein by reference
  • DNA microarrays e.g., with labeled probes adhered or conjugated to a solid surface or DNA array chip
  • the sequence reads may be aligned to a reference genome using known methods in the art to determine alignment position information.
  • the alignment position information may indicate a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide base and end nucleotide base of a given sequence read.
  • Alignment position information may also include sequence read length, which can be determined from the beginning position and end position.
  • a region in the reference genome may be associated with a gene or a segment of a gene.
  • a sequence read is comprised of a read pair denoted as R- and R 2.
  • the first read R ⁇ may be sequenced from a first end of a nucleic acid fragment whereas the second read R 2 may be sequenced from the second end of the nucleic acid fragment. Therefore, nucleotide base pairs of the first read R and second read R 2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome.
  • Alignment position information derived from the read pair R and R 2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R ⁇ ) and an end position in the reference genome that corresponds to an end of a second read (e.g., R 2 ).
  • the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds.
  • An output file having SAM (sequence alignment map) format or BAM (binary alignment map) format may be generated and output for further analysis.
  • the location and methylation state for each of CpG site may be determined based on alignment to a reference genome. Further, a methylation state vector for each fragment may be generated specifying a location of the fragment in the reference genome (e.g., as specified by the position of the first CpG site in each fragment, or another similar metric), a number of CpG sites in the fragment, and the methylation state of each CpG site in the fragment whether methylated (e.g., denoted as M), unmethylated (e.g., denoted as U), or indeterminate (e.g., denoted as I).
  • the methylation state vectors may be stored in temporary or persistent computer memory for later use and processing.
  • duplicate reads or duplicate methylation state vectors from a single subject may be removed.
  • it may be determined that a certain fragment has one or more CpG sites that have an indeterminate methylation status. Such fragments may be excluded from later processing or selectively included where downstream data model accounts for such indeterminate methylation statuses.
  • FIG. 7B is an illustration of the process 100 of FIG. 7A of sequencing a cfDNA fragment to obtain a methylation state vector, according to an embodiment.
  • the analytics system takes a cfDNA fragment 112.
  • the cfDNA fragment 112 contains three CpG sites.
  • the first and third CpG sites of the cfDNA fragment 112 are methylated 114.
  • the cfDNA fragment 112 is converted to generate a converted cfDNA fragment 122.
  • the second CpG site which was unmethylated has its cytosine converted to uracil. However, the first and third CpG sites are not convert.
  • a sequencing library 130 is prepared and sequenced 140 generating a sequence read 142.
  • the analytics system aligns 150 the sequence read 142 to a reference genome 144.
  • the reference genome 144 provides the context as to what position in a human genome the fragment cfDNA originates from.
  • the analytics system aligns 150 the sequence read such that the three CpG sites correlate to CpG sites 23, 24, and 25 (arbitrary reference identifiers used for convenience of description).
  • the analytics system thus generates information both on methylation status of all CpG sites on the cfDNA fragment 112 and which to position in the human genome the CpG sites map.
  • the CpG sites on sequence read 142 which were methylated are read as cytosines.
  • the cytosine’s appear in the sequence read 142 only in the first and third CpG site which allows one to infer that the first and third CpG sites in the original cfDNA fragment were methylated.
  • the second CpG site is read as a thymine (U is converted to T during the sequencing process), and thus, one can infer that the second CpG site was unmethylated in the original cfDNA fragment.
  • the analytics system With these two pieces of information, the methylation status and location, the analytics system generates 160 a methylation state vector 152 for the fragment cfDNA 112.
  • the resulting methylation state vector 152 is ⁇ M23, U24, M25 >, wherein M corresponds to a methylated CpG site, U corresponds to an unmethylated CpG site, and the subscript numbers correspond to positions of each CpG site in the reference genome.
  • FIGs. 8A & 8B show three graphs of data validating consistency of sequencing from a control group.
  • the first graph 170 shows conversion accuracy of conversion of unmethylated cytosines to uracil (step 120) on cfDNA fragment obtained from a test sample across subjects in varying stages of cancer - stage 0, stage I, stage II, stage III, stage IV, and non-cancer. As shown, there was uniform consistency in converting unmethylated cytosines on cfDNA fragments into uracils. There was an overall conversion accuracy of 99.47% with a precision at ⁇ 0.024%.
  • the second graph 180 compares coverage (depth of sequencing) over varying stages of cancer. Counting only sequence reads that were confidently mapped to a reference genome, the mean coverage over all groups was ⁇ 34.
  • the third graph 190 shows the concentration of cfDNA per sample across varying stages of cancer.
  • Sequence reads obtained by the methods provided herein are further processed by automated algorithms.
  • the analytics system is used to receive sequencing data from a sequencer and perform various aspects of processing as described herein.
  • the analytics system can be one of a personal computer (PC), a desktop computer, a laptop computer, a notebook, a tablet PC, a mobile device.
  • a computing device can be communicatively coupled to the sequencer through a wireless, wired, or a combination of wireless and wired communication technologies.
  • the computing device is configured with a processor and memory storing computer instructions that, when executed by the processor, cause the processor to perform steps as described in the remainder of this document.
  • the amount of genetic data and data derived therefrom is sufficiently large, and the amount of computational power required so great, so as to be impossible to be performed on paper or by the human mind alone.
  • the clinical interpretation of methylation status of targeted genomic regions is a process that includes classifying the clinical effect of each or a combination of the methylation status and reporting the results in ways that are meaningful to a medical professional.
  • the clinical interpretation can be based on comparison of the sequence reads with database specific to cancer or non-cancer subjects, and/or based on numbers and types of the cfDNA fragments having cancer-specific methylation patterns identified from a sample.
  • targeted genomic regions are ranked or classified based on their likeness to be differentially methylated in cancer samples, and the ranks or classifications are used in the interpretation process.
  • the ranks and classifications can include (1) the type of clinical effect, (2) the strength of evidence of the effect, and (3) the size of the effect.
  • the clinical interpretation of the methylation states of such differentially methylated regions can be based on machine learning approaches that interpret a current sample based on a classification or regression method that was trained using the methylation states of such differentially methylated regions from samples from cancer and non-cancer patients with known cancer status, cancer type, cancer stage, TOO, etc.
  • the clinically meaning information can include the presence or absence of cancer generally, presence or absence of certain types of cancers, cancer stage, or presence or absence of other types of diseases.
  • the information relates to a presence or absence of one or more cancer types, selected from the group consisting of breast cancer, endometrial cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of renal pelvis, renal cell carcinoma, prostate cancer, anorectal cancer, colorectal cancer, hepatocellular cancer, cholangiocarcinoma and hepatobiliary cancer, pancreatic cancer, upper GI adenocarcinoma, esophageal squamous cell cancer, head and neck cancer, squamous cell lung cancer, lung adenocarcinoma, small cell lung cancer, neuroendocrine cancer, melanoma, thyroid cancer, sarcoma, multiple myeloma, myeloid neoplasm, lymphoma, and leukemia.
  • cancer types selected from the
  • the samples are not cancerous and are from subjects having white blood cell clonal expansion or no cancer.
  • the assay panel described herein can be used with a cancer type classifier that predicts a disease state for a sample, such as a cancer or non-cancer prediction, a tissue of origin prediction, and/or an indeterminate prediction.
  • the cancer type classifier can generate features based on sequence reads by taking into account methylated or unmethylated fragments of DNA at certain genomic areas of interest. For instance, if the cancer type classifier determines that a methylation pattern at a fragment resembles that of a certain cancer type, then the cancer type classifier can set a feature for that fragment as 1, and otherwise if no such fragment is present, then the feature can be set as 0.
  • the cancer type classifier can produce a set of binary features (merely by way of example, 30,000 features) for each sample. Further, in some examples, all or a portion of the set of binary features for a sample can be input into the cancer type classifier to provide a set of probability scores, such as one probability score per cancer type class and for a non-cancer type class. Furthermore, in some examples, the cancer type classifier can incorporate or otherwise be used in conjunction with thresholding to determine whether a sample is to be called as cancer or non-cancer, and/or indeterminate thresholding to reflect confidence in a specific TOO call. Such methods are described further below.
  • the analytics system can obtain a set of training samples.
  • each training sample includes fragment file(s) (e.g., file containing sequence read data), a label corresponding to a type of cancer (TOO) or non-cancer status of the sample, and/or sex of the individual of the sample.
  • the analytics system can utilize the training set to train the cancer type classifier to predict the disease state of the sample.
  • the analytics system divides the genome (e.g., whole genome) or a subset of the genome (e.g., targeted methylation regions) into regions.
  • portions of the genome can be separated into“blocks” of CpGs, whereby a new block begins whenever there is a separation between nearest-neighbor CpGs is at least a minimum separation distance (e.g., at least 500 bp).
  • each block can be divided into 1000 bp regions and positioned such that neighboring regions have a certain amount (e.g., 50% or 500 bp) of overlap.
  • the analytics system can split the training set into K subsets or folds to be used in a K-fold cross-validation.
  • the folds can be balanced for cancer/non-cancer status, tissue of origin, cancer stage, age (e.g., grouped in lOyr buckets), and/or smoking status.
  • the training set is split into 5 folds, whereby 5 separate classifiers are trained, in each case training on 4/5 of the training samples and using the remaining 1/5 for validation.
  • the analytics system can, for each cancer type (and for healthy cfDNA), fit a probabilistic model to the fragments deriving from the samples of that type.
  • a“probabilistic model” is any mathematical model capable of assigning a probability to a sequence read based on methylation status at one or more sites on the read.
  • the analytics system fits sequence reads derived from one or more samples from subjects having a known disease and can be used to determine sequence reads probabilities indicative of a disease state utilizing methylation information or methylation state vectors. In particular, in some cases, the analytics system determines observed rates of methylation for each CpG site within a sequence read.
  • the rate of methylation represents a fraction or percentage of base pairs that are methylated within a CpG site.
  • the trained probabilistic model can be parameterized by products of the rates of methylation.
  • any known probabilistic model for assigning probabilities to sequence reads from a sample can be used.
  • the probabilistic model can be a binomial model, in which every site (e.g., CpG site) on a nucleic acid fragment is assigned a probability of methylation, or an independent sites model, in which each CpG’s methylation is specified by a distinct methylation probability with methylation at one site assumed to be independent of methylation at one or more other sites on the nucleic acid fragment.
  • the probabilistic model is a Markov model, in which the probability of methylation at each CpG site is dependent on the methylation state at some number of preceding CpG sites in the sequence read, or nucleic acid molecule from which the sequence read is derived. See, e.g., U.S. Pat. Appl. No. 16/352,602, entitled“Anomalous Fragment Detection and Classification,” and filed March 13, 2019, which is incorporated by reference in its entirety herein and can be used for various embodiments.
  • the probabilistic model is a“mixture model” fitted using a mixture of components from underlying models.
  • the mixture components can be determined using multiple independent sites models, where methylation (e.g., rates of methylation) at each CpG site is assumed to be independent of methylation at other CpG sites.
  • methylation e.g., rates of methylation
  • the probability assigned to a sequence read, or the nucleic acid molecule from which it derives is the product of the methylation probability at each CpG site where the sequence read is methylated and one minus the methylation probability at each CpG site where the sequence read is unmethylated.
  • the analytics system determines rates of methylation of each of the mixture components.
  • the mixture model is parameterized by a sum of the mixture components each associated with a product of the rates of methylation.
  • a probabilistic model Pr of n mixture components can be represented as:
  • rri j E (0, 1 ⁇ represents the fragment’s observed methylation status at position i of a reference genome, with 0 indicating unmethylation and 1 indicating methylation.
  • the probability of methylation at position i in a CpG site of mixture component k is thus, the probability of unmethylation is 1— /? fci.
  • the number of mixture components n can be 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, etc.
  • the analytics system fits the probabilistic model using maximum- likelihood estimation to identify a set of parameters ki ,f k ⁇ that maximizes the log-likelihood of all fragments deriving from a disease state, subject to a regularization penalty applied to each methylation probability with regularization strength r.
  • the maximized quantity for N total fragments can be represented as:
  • the analytics system performs fits separately for each cancer type and for healthy cfDNA.
  • other means can be used to fit the probabilistic models or to identify parameters that maximize the log-likelihood of all sequence reads derived from the reference samples.
  • Bayesian fitting using e.g., Markov chain Monte Carlo
  • each parameter is not assigned a single value but instead is associated to a distribution
  • gradient-based optimization in which the gradient of the likelihood (or log-likelihood) with respect to the parameter values is used to step through parameter space towards an optimum, is used.
  • expectation-maximization in which a set of latent parameters (such as identities of the mixture component from which each fragment is derived) are set to their expected values under the previous model parameters, and then the model’s parameters are assigned to maximize the likelihood conditional on the assumed values of those latent variables. The two-step process is then repeated until convergence.
  • latent parameters such as identities of the mixture component from which each fragment is derived
  • the analytics system can generate features for each sample in the training set. For example, for each sample (regardless of label), in each region, for each cancer type, for each fragment, the analytics system can evaluate the log-likelihood ratio R with the fitted probabilistic models according to:
  • the analytics system can count the number of fragments with Rcancer type > tier and assign those counts as non-negative integer-valued features.
  • the tiers include threshold values of 1, 2, 3, 4, 5, 6, 7, 8, and 9, resulting in each region hosting 9 features per cancer type.
  • the analytics system can select certain features for inclusion in a feature vector for each sample. For example, for each pair of distinct cancer types, the analytics system can specify one type as the“positive type” and the other as the“negative type” and rank the features by their ability to distinguish those types. In some cases, the ranking is based on mutual information calculated by the analytics system. For example, the mutual information can be calculated using the estimated fraction of samples of the positive type and negative type (e.g., cancer types A and B) for which the feature is expected to be nonzero in a resulting assay. For instance, if a feature occurs frequently in healthy cfDNA, the analytics system determines the feature is unlikely to occur frequently in cfDNA associated with various types of cancer.
  • the mutual information can be calculated using the estimated fraction of samples of the positive type and negative type (e.g., cancer types A and B) for which the feature is expected to be nonzero in a resulting assay. For instance, if a feature occurs frequently in healthy cfDNA, the analytics system determines the feature is unlikely to
  • variable A is a certain feature (e.g., binary) and variable Y represents a disease state, e.g., cancer type A or B:
  • the joint probability mass function of Xand Y is p(x,y) and the marginal probability mass functions are p(x) and p(y).
  • f A is the probability of observing the feature in ctDNA samples from tumor (or high-signal cfDNA samples) associated with cancer type A
  • f H is the probability of observing the feature in a healthy or non-cancer cfDNA sample.
  • only features corresponding to the positive type are included in the ranking, and only when those features’ predicted rate of occurrence is greater in the positive type than in the negative type. For example, if“liver” is the positive type and“breast” is the negative type, then only“liver x” features are considered, and only if their estimated occurrence in liver cfDNA is greater than their estimated occurrence in breast cfDNA. Further, in some examples, for each region, for each cancer type pair (including non-cancer as a negative type), the analytics system keeps only the best performing tier. Further, in some examples, the analytics system transforms feature values by binarization, whereby any feature value greater than 0 is set to 1, such that all features are either 0 or 1.
  • the analytics system trains a multinomial logistic regression classifier on the training data for a fold, and generates predictions for the held-out data. For example, for each of the K folds, one logistic regression can be trained for each combination of
  • hyperparameters can include L2 penalty and/or topK (e.g., the number of high-ranking regions to keep per tissue type pair (including non-cancer), as ranked by the mutual information procedure outlined above).
  • L2 penalty and/or topK e.g., the number of high-ranking regions to keep per tissue type pair (including non-cancer), as ranked by the mutual information procedure outlined above.
  • performance is evaluated on the cross-validated predictions of the full training set, and the set of hyperparameters with the best performance is selected for retraining on the full training set.
  • the analytics system uses log-loss as a performance metric, whereby the log-loss is calculated by taking the negative logarithm of the prediction for the correct label for each sample, and then summing over samples (i.e. a perfect prediction of 1.0 for the correct label would give a log-loss of 0).
  • the analytics trains a two-stage classifier.
  • the analytics system trains a binary cancer classifier to distinguish between the labels, cancer and non-cancer, based on the feature vectors of the training samples.
  • the binary classifier outputs a prediction score indicating the likelihood of the presence or absence of cancer.
  • the analytics system trains a multiclass cancer classifier to distinguish between many cancer types.
  • the cancer classifier is trained to determine a cancer prediction that comprises a prediction value for each of the cancer types being classified for.
  • the prediction values can correspond to a likelihood that a given sample has each of the cancer types.
  • the cancer classifier returns a cancer prediction including a prediction value for breast cancer, lung cancer, and non-cancer.
  • the cancer classifier may return a cancer prediction for a test sample including a prediction score for breast cancer, lung cancer, and/or no cancer.
  • the analytics system can train the cancer classifier according to any one of a number of methods.
  • the binary cancer classifier may be a L2-regularized logistic regression classifier that is trained using a log-loss function.
  • the multi-cancer (TOO) classifier may be a multinomial logistic regression. In practice either type of cancer classifier may be trained using other techniques. These techniques are numerous including potential use of kernel methods, machine learning algorithms such as multilayer neural networks, etc. In particular, methods as described in PCT/US2019/022122 and U.S. Patent. App. No. 16/352,602 which are incorporated by reference in their entireties herein can be used for various
  • the TOO classifier is trained only on cancer samples that were successfully called as cancer by the binary classifier, thereby ensuring sufficient cancer signal in the cancer sample.
  • the binary classifier is trained on the training samples regardless of TOO.
  • FIG. 10A is a flowchart of systems and devices for sequencing nucleic acid samples according to one embodiment.
  • This illustrative flowchart includes devices such as a sequencer 820 and an analytics system 800.
  • the sequencer 820 and the analytics system 800 may work in tandem to perform one or more steps in the processes described herein.
  • the sequencer 820 receives an enriched nucleic acid sample 810.
  • the sequencer 820 can include a graphical user interface 825 that enables user interactions with particular tasks (e.g., initiate sequencing or terminate sequencing) as well as one more loading stations 830 for loading a sequencing cartridge including the enriched fragment samples and/or for loading necessary buffers for performing the sequencing assays. Therefore, once a user of the sequencer 820 has provided the necessary reagents and sequencing cartridge to the loading station 830 of the sequencer 820, the user can initiate sequencing by interacting with the graphical user interface 825 of the sequencer 820. Once initiated, the sequencer 820 performs the sequencing and outputs the sequence reads of the enriched fragments from the nucleic acid sample 810.
  • the sequencer 820 is communicatively coupled with the analytics system 800.
  • the analytics system 800 includes some number of computing devices used for processing the sequence reads for various applications such as assessing methylation status at one or more CpG sites, variant calling or quality control.
  • the sequencer 820 may provide the sequence reads in a BAM file format to the analytics system 800.
  • the analytics system 800 can be communicatively coupled to the sequencer 820 through a wireless, wired, or a combination of wireless and wired communication technologies.
  • the analytics system 800 is configured with a processor and non-transitory computer-readable storage medium storing computer instructions that, when executed by the processor, cause the processor to process the sequence reads or to perform one or more steps of any of the methods or processes disclosed herein.
  • the sequence reads may be aligned to a reference genome using known methods in the art to determine alignment position information.
  • Alignment position may generally describe a beginning position and an end position of a region in the reference genome that corresponds to a beginning nucleotide based and an end nucleotide base of a given sequence read.
  • the alignment position information may be generalized to indicate a first CpG site and a last CpG site included in the sequence read according to the alignment to the reference genome.
  • the alignment position information may further indicate methylation statuses and locations of all CpG sites in a given sequence read.
  • a region in the reference genome may be associated with a gene or a segment of a gene; as such, the analytics system 800 may label a sequence read with one or more genes that align to the sequence read.
  • fragment length (or size) is determined from the beginning and end positions.
  • a sequence read is comprised of a read pair denoted as R_1 and R_2.
  • the first read R_1 may be sequenced from a first end of a double-stranded DNA (dsDNA) molecule whereas the second read R_2 may be sequenced from the second end of the double-stranded DNA (dsDNA). Therefore, nucleotide base pairs of the first read R_1 and second read R_2 may be aligned consistently (e.g., in opposite orientations) with nucleotide bases of the reference genome.
  • Alignment position information derived from the read pair R_1 and R_2 may include a beginning position in the reference genome that corresponds to an end of a first read (e.g., R_l) and an end position in the reference genome that corresponds to an end of a second read (e.g., R_2).
  • the beginning position and end position in the reference genome represent the likely location within the reference genome to which the nucleic acid fragment corresponds.
  • the read pair R_1 and R_2 can be assembled into a fragment, and the fragment used for subsequent analysis and/or classification.
  • sequence alignment map format or BAM (binary) format may be generated and output for further analysis.
  • FIG. 14B is a block diagram of an analytics system 800 for processing DNA samples according to one embodiment.
  • the analytics system implements one or more computing devices for use in analyzing DNA samples.
  • the analytics system 800 includes a sequence processor 840, sequence database 845, model database 855, models 850, parameter database 865, and score engine 860.
  • the analytics system 800 performs one or more steps in the processes 300 of FIG. 3A, 340 of FIG. 3B, 400 of FIG. 4, 500 of FIG. 5, 600 of FIG. 6A, or 680 of FIG. 6B and other process described herein.
  • the sequence processor 840 generates methylation state vectors for fragments from a sample. At each CpG site on a fragment, the sequence processor 840 generates a methylation state vector for each fragment specifying a location of the fragment in the reference genome, a number of CpG sites in the fragment, and the methylation state of each CpG site in the fragment whether methylated, unmethylated, or indeterminate via the process 300 of FIG. 3A.
  • the sequence processor 840 may store methylation state vectors for fragments in the sequence database 845. Data in the sequence database 845 may be organized such that the methylation state vectors from a sample are associated to one another.
  • models 850 may be stored in the model database 855 or retrieved for use with test samples.
  • a model is a trained cancer classifier for determining a cancer prediction for a test sample using a feature vector derived from anomalous fragments. The training and use of the cancer classifier is discussed elsewhere herein.
  • the analytics system 800 may train the one or more models 850 and store various trained parameters in the parameter database 865.
  • the analytics system 800 stores the models 850 along with functions in the model database 855.
  • the score engine 860 uses the one or more models 850 to return outputs.
  • the score engine 860 accesses the models 850 in the model database 855 along with trained parameters from the parameter database 865.
  • the score engine receives an appropriate input for the model and calculates an output based on the received input, the parameters, and a function of each model relating the input and the output.
  • the score engine 860 further calculates metrics correlating to a confidence in the calculated outputs from the model.
  • the score engine 860 calculates other intermediary values for use in the model.
  • the methods, analytic systems and/or classifier of the present invention can be used to detect the presence (or absence) of cancer, monitor cancer progression or recurrence, monitor therapeutic response or effectiveness, determine a presence or monitor minimum residual disease (MRD), or any combination thereof.
  • the analytic systems and/or classifier may be used to identify the tissue or origin for a cancer.
  • the systems and/or classifiers may be used to identify a cancer as of any of the following cancer types: breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of renal pelvis, renal cancer other than urothelial, prostate cancer, anorectal cancer, anal cancer, colorectal cancer, hepatobiliary cancer arising from hepatocytes, hepatobiliary cancer arising from cells other than hepatocytes, liver/bile-duct cancer, esophageal cancer, pancreatic cancer, squamous cell cancer of the upper gastrointestinal tract, upper gastrointestinal cancer other than squamous, head and neck cancer, lung cancer, lung cancer, lung cancer, lung cancer, lung
  • a classifier can be used to generate a likelihood or probability score (e.g., from 0 to 100) that a sample feature vector is from a subject with cancer.
  • the probability score is compared to a threshold probability to determine whether or not the subject has cancer.
  • the likelihood or probability score can be assessed at different time points (e.g., before or after treatment) to monitor disease progression or to monitor treatment effectiveness (e.g., therapeutic efficacy).
  • the likelihood or probability score can be used to make or influence a clinical decision (e.g., detection of cancer, treatment selection, assessment of treatment effectiveness, etc.). For example, in one embodiment, if the likelihood or probability score exceeds a threshold, a physician can prescribe an appropriate treatment.
  • the methods and/or classifier of the present invention are used to detect a cancer type in a subject suspected of having cancer.
  • a classifier (as described herein) can be used to determine a likelihood or probability score that a sample feature vector is from a subject that has a cancer type.
  • a probability score of greater than or equal to 60 can indicated that the subject has the cancer type.
  • a probability score can indicate the severity of disease. For example, a probability score of 80 may indicate a more severe form, or later stage, of cancer compared to a score below 80 (e.g., a score of 70). Similarly, an increase in the probability score over time (e.g., at a second, later time point) can indicate disease progression or a decrease in the probability score over time (e.g., at a second, later time point) can indicate successful treatment.
  • a cancer log-odds ratio can be calculated for a test subject by taking the log of a ratio of a probability of being a cancer type over a probability of not being the cancer type (i.e., one minus the probability of being the cancer type), as described herein.
  • a cancer log-odds ratio greater than 1 can indicate that the subject has a cancer type.
  • a cancer type log-odds ratio greater than 1.2, greater than 1.3, greater than 1.4, greater than 1.5, greater than 1.7, greater than 2, greater than 2.5, greater than 3, greater than 3.5, or greater than 4, indicates that the subject has the cancer type.
  • a cancer log-odds ratio can indicate the severity of disease.
  • a cancer log-odds ratio greater than 2 may indicate a more severe form, or later stage, of a form of cancer compared to a score below 2 (e.g., a score of 1).
  • an increase in the cancer log-odds ratio over time e.g., at a second, later time point
  • can indicate disease progression or a decrease in the cancer log-odds ratio over time can indicate successful treatment.
  • the methods and systems of the present invention can be trained to detect or classify multiple cancer indications.
  • the methods, systems and classifiers of the present invention can be used to detect the presence of one or more, two or more, three or more, five or more, or ten or more different types of cancer.
  • the cancer is one or more of head and neck cancer, liver/bile duct cancer, upper GI cancer, pancreatic/gallbladder cancer; colorectal cancer, ovarian cancer, lung cancer, multiple myeloma, lymphoid neoplasms, melanoma, sarcoma, breast cancer, and uterine cancer.
  • the cancer is one or more of anorectal cancer, bladder or urothelial cancer, or cervical cancer.
  • the cancer is one or more of breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of renal pelvis, renal cancer other than urothelial, prostate cancer, anorectal cancer, anal cancer, colorectal cancer, hepatobiliary cancer arising from hepatocytes, hepatobiliary cancer arising from cells other than hepatocytes, liver/bile-duct cancer, esophageal cancer, pancreatic cancer, squamous cell cancer of the upper gastrointestinal tract, upper gastrointestinal cancer other than squamous, head and neck cancer, lung cancer, lung adenocarcinoma, small cell lung cancer, squamous cell lung cancer and cancer other than adenocarcinoma or small cell lung cancer, neuroendocrine cancer, melanoma, thyroid cancer, sarcoma, plasma cell neoplasm, multiple myeloma, myeloid neoplasm, lymphoma,
  • the likelihood or probability score can be assessed at different time points (e.g., or before or after treatment) to monitor disease progression or to monitor treatment effectiveness (e.g., therapeutic efficacy).
  • the present disclosure provides methods that involve obtaining a first sample (e.g., a first plasma cfDNA sample) from a cancer patient at a first time point, determining a first likelihood or probability score therefrom (as described herein), obtaining a second test sample (e.g., a second plasma cfDNA sample) from the cancer patient at a second time point, and determine a second likelihood or probability score therefrom (as described herein).
  • information obtained from any method described herein can be used to make or influence a clinical decision (e.g., diagnosis of cancer, treatment selection, assessment of treatment effectiveness, etc.). For example, in one embodiment, if the likelihood or probability score exceeds a threshold, a physician can prescribe an appropriate treatment (e.g., a resection surgery, radiation therapy, chemotherapy, and/or immunotherapy). In some embodiments, information such as a likelihood or probability score can be provided as a readout to a physician or subject.
  • a classifier (as described herein) can be used to determine a likelihood or probability score that a sample feature vector is from a subject that has cancer or a particular type of cancer (e.g., tissue of origin).
  • an appropriate treatment e.g., resection surgery or therapeutic
  • the likelihood or probability exceeds a threshold. For example, in one embodiment, if the likelihood or probability score is greater than or equal to 60, one or more appropriate treatments are prescribed. In another embodiments, if the likelihood or probability score is greater than or equal to 65, greater than or equal to 70, greater than or equal to 75, greater than or equal to 80, greater than or equal to 85, greater than or equal to 90, or greater than or equal to 95, one or more appropriate treatments are prescribed.
  • a cancer log-odds ratio can indicate the effectiveness of a cancer treatment. For example, an increase in the cancer log-odds ratio over time (e.g., at a second, after treatment) can indicate that the treatment was not effective. Similarly, a decrease in the cancer log-odds ratio over time (e.g., at a second, after treatment) can indicate successful treatment. In another embodiment, if the cancer log-odds ratio is greater than 1, greater than 1.5, greater than 2, greater than 2.5, greater than 3, greater than 3.5, or greater than 4, one or more appropriate treatments are prescribed.
  • the treatment is one or more cancer therapeutic agents selected from the group consisting of a chemotherapy agent, a targeted cancer therapy agent, a differentiating therapy agent, a hormone therapy agent, and an immunotherapy agent.
  • the treatment can be one or more chemotherapy agents selected from the group consisting of alkylating agents, antimetabolites, anthracyclines, anti-tumor antibiotics, cytoskeletal disruptors (taxans), topoisomerase inhibitors, mitotic inhibitors, corticosteroids, kinase inhibitors, nucleotide analogs, platinum-based agents and any combination thereof.
  • the treatment is one or more targeted cancer therapy agents selected from the group consisting of signal transduction inhibitors (e.g.
  • the treatment is one or more differentiating therapy agents including retinoids, such as tretinoin, alitretinoin and bexarotene.
  • the treatment is one or more hormone therapy agents selected from the group consisting of anti-estrogens, aromatase inhibitors, progestins, estrogens, anti-androgens, and GnRH agonists or analogs.
  • the treatment is one or more immunotherapy agents selected from the group comprising monoclonal antibody therapies such as rituximab (RITUXAN) and alemtuzumab (CAMPATH), non-specific immunotherapies and adjuvants, such as BCG, interleukin-2 (IL-2), and interferon-alfa, immunomodulating drugs, for instance, thalidomide and lenalidomide (REVLIMID).
  • monoclonal antibody therapies such as rituximab (RITUXAN) and alemtuzumab (CAMPATH)
  • non-specific immunotherapies and adjuvants such as BCG, interleukin-2 (IL-2), and interferon-alfa
  • immunomodulating drugs for instance, thalidomide and lenalidomide (REVLIMID). It is within the capabilities of a skilled physician or oncologist to select an appropriate cancer therapeutic agent based on characteristics such as the type of tumor, cancer stage, previous exposure to cancer treatment or therapeutic agent, and other characteristics of
  • each probe was scored based on the number of off- target regions. The best probes have a score of 1, meaning they match in only one place (high Q). Probes with a low score between 2-19 hits (low Q) were accepted but probes with a poor score more than 20 hits (poor Q) were discarded. Other cutoff values can be used for specific samples.
  • Cancer types Cancer-specific panels were designed to detect cancer and/or cancer tissue of origin of fifteen (15) different cancer types.
  • the 15 cancer types include (1) bladder cancer,
  • Circulating Cell-free Genome Atlas Study (“CCGA”; Clinical Trial.gov identifier NCT02889978) is a prospective, multi-center, case-control, observational study with
  • biospecimens were collected from approximately 15,000 participants from 142 sites. Samples were selected to ensure a prespecified distribution of cancer types and non-cancers across sites in each cohort, and cancer and non-cancer samples were frequency age-matched by gender.
  • TCGA Clinical Trial.gov identifier NCT02889978
  • NCI National Cancer Institute
  • NHGRI National Human Genome Research Institute
  • DTC Dissociated tumor cells
  • Non-cancer cells were provided by Yuval Dor and Ben Glaser (Hebrew University) and originated from human tissue obtained from standard clinical procedures. For example, breast luminal and basal epithelial cells were from breast reduction surgery; colon epithelial cells were from tissue near the site of re-implantation following segmental resection for localized colon pathology; bone marrow cells were from joint replacement surgery; vascular and arterial endothelial cells were from vascular surgery; and head and neck epithelium was from
  • WGBS was performed on more than 1000 genomic DNA samples collected from healthy individuals and individuals diagnosed with cancers of various stages and tissues of origin.
  • the samples included formaldehyde-fixed, paraffin-embedded (FFPE) tissue blocks, disseminated tumor cells (DTC) from cancers of different TOOs, bone marrow mononuclear cells (BMMC), white blood cells (WBC) and peripheral blood mononuclear cells (PBMC).
  • FFPE formaldehyde-fixed, paraffin-embedded
  • DTC disseminated tumor cells
  • BMMC bone marrow mononuclear cells
  • WBC white blood cells
  • PBMC peripheral blood mononuclear cells
  • the DTCs were subjected to negative selection to remove WBCs, fibroblasts, and endothelial cells using a negative selection kit (Miltenyi) prior to gDNA isolation.
  • the negative selection yielded purified tumor cells that allowed differentially methylated regions to be more clearly identified.
  • the TCGA data was collected by hybridization of bi sulfite-converted DNA fragments from 8809 samples to methylation-sensitive oligonucleotide arrays b-values from this study represent the relative abundance of methylation at 480,000 individual CpG sites. 75,000 of these CpG sites were analyzed after excluding CpGs from noisy genomic regions (360,000) and CpG sites with cross-hybridizing probes (45,000). The TCGA data was analyzed using different algorithms because it describes methylation of individual CpG sites, whereas WGBS data reveals the methylation pattern of strings of adjacent CpG sites on DNA fragments.
  • Tissue of Origin classes Each sample was categorized into one of twenty-five (25) different Tissue of Origin (TOO) classes: breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of renal pelvis, renal cancer other than urothelial, prostate cancer, anorectal cancer, colorectal cancer, hepatobiliary cancer arising from Tissue of Origin (TOO) classes: breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of renal pelvis, renal cancer other than urothelial, prostate cancer, anorectal cancer, colorectal cancer, hepatobiliary cancer arising from Tissue of Origin (TOO) classes: breast cancer, uterine cancer, cervical cancer, ovarian cancer, bladder cancer, urothelial cancer of renal pelvis, renal cancer other than urothelial, prostate cancer, anorectal cancer, colorectal cancer, hepatobiliary cancer arising from Tissue of Origin (TOO
  • hepatocytes hepatobiliary cancer arising from cells other than hepatocytes
  • pancreatic cancer squamous cell cancer of the upper gastrointestinal tract, upper gastrointestinal cancer other than squamous, head and neck cancer
  • lung adenocarcinoma small cell lung cancer
  • neuroendocrine cancer melanoma
  • thyroid cancer sarcoma
  • multiple myeloma lymphoma
  • leukemia leukemia
  • Leukemia LAML Leukemia LAML, LCML 140
  • Region selection For target selection, fragments having abnormal methylation patterns in cancer samples were selected using one or more method as described herein. Use of these methods allowed identification of low noise regions as putative targets. Among the low noise regions, fragments most informative in discriminating cancer types were ranked and selected.
  • fragment sequences in the database were filtered based on p-value using a non-cancer distribution, and only fragments with p ⁇ 0.001 were retained, as described herein.
  • the selected cfDNAs were further filtered to retain only those that were at least 90% methylated or 90% unmethylated.
  • CpG sites were ranked based on their information gain, comparing (i) the numbers of samples of a specific TOO or other samples, including both non-cancer samples and samples of a different TOO, (ii) the numbers of samples of a specific TOO or non-cancer samples, and/or (iii) the numbers of samples of a specific TOO or a different TOO that include fragments overlapping that CpG site.
  • the process was applied to each of the 25 TOOs and the comparison was done for all pairwise combinations for 25 TOOs. For example, P (cancer of a TOO
  • genomic regions selected by the pairwise comparisons included genomic regions differentially methylated to separate a target TOO and a contrast TOO.
  • the numbers of genomic regions for differentiating each target TOO (x-axis) from a contrast TOO (y-axis) are provided in FIG. 11.
  • CpG beta value indicating intensity of methylation was used to identify target genomic regions. This is because array data are not at CpG site levels, and thus they are prone to result in false positives. To avoid false positives, CpG sites were converted into 350 bp bins across the genome. Beta values of each bin were calculated as the mean of CpG beta values in that bin. Bins with less than 2 CpG’s were excluded from the analysis.
  • bins were selected with beta difference of > 0.95 between (i) samples of a specific TOO and other samples, including both non-cancer samples and samples of a different TOO, (ii) samples of a specific TOO and non-cancer samples, and/or (iii) samples of a specific TOO and a different TOO that include fragments overlapping that CpG site.
  • the table identifies the cancer type detected, the total number of target genomic regions in the list, a range of SEQ ID NOs corresponding to all target genomic regions in the list to be found in the sequence listing submitted with this application, and a panel size (total of the lengths of all target genomic regions in the list).
  • the sequence listing identifies the chromosomal location of each target genomic region, whether cfDNA fragments to be enriched from the region are hypermethylated or hypomethylated, and the sequence of one DNA strand of the target genomic region.
  • the chromosome numbers and the start and stop positions are provided relative to a known human reference genome, hgl9.
  • the sequence of the human reference genome, hgl9 is available from Genome Reference Consortium with a reference number, GRCh37/hgl9, and also available from Genome Browser provided by Santa Cruz Genomics Institute.
  • Additional cancer assay panels were designed to identify specific cancer types in a manner analogous to that set forth in Example 2.
  • Various lists of target genomic regions selected as described in this section are identified in TABLE 3 (see Lists 16-49).
  • the target genomic regions of Lists 16-32 contain subsets of the methylation sites of the target genomic regions of Lists 33-49, respectively.
  • the table identifies the cancer type detected, the total number of target genomic regions in the list, a range of SEQ ID NOs corresponding to all target genomic regions in the list to be found in the sequence listing submitted with this application, and a panel size (total of the lengths of all target genomic regions in the list).
  • the sequence listing identifies the chromosomal location of each target genomic region, whether cfDNA fragments to be enriched from the region are hypermethylated or hypomethylated, and the sequence of one DNA strand of the target genomic region.
  • the chromosome numbers and the start and stop positions are provided relative to a known human reference genome, hgl9.
  • the sequence of the human reference genome, hgl9 is available from Genome Reference Consortium with a reference number, GRCh37/hgl9, and also available from Genome Browser provided by Santa Cruz Genomics Institute.
  • the predictive cancer models described in this Example were trained using sequence data obtained from a plurality of samples from known cancer types and non-cancers from both CCGA sub-studies (CCGA1 and CCGA22), a plurality of tissue samples for known cancers obtained from CCGA1, and a plurality of non-cancer samples from the STRIVE study (See Clinical Trail.gov Identifier: NCT03085888 (//clinicaltrials.gov/ct2/show/NCT03085888)).
  • the STRIVE study is a prospective, multi-center, observational cohort study to validate an assay for the early detection of breast cancer and other invasive cancers, from which additional non-cancer training samples were obtained to train the classifier described herein.
  • a model can be a multi-cancer model (or a multi-cancer classifier) for detecting one or more, two or more, three or more, four or more, five or more, ten or more, or 20 or more different types of cancer.
  • the classifier performance data shown below was reported out for a locked classifier trained on cancer and non-cancer samples obtained from CCGA2, a CCGA sub-study, and on non-cancer samples from STRIVE.
  • the individuals in the CCGA2 sub-study were different from the individuals in the CCGA1 sub-study whose cfDNA was used to select target genomes.
  • blood samples were collected from individuals diagnosed with untreated cancer (including 20 tumor types and all stages of cancer) and healthy individuals with no cancer diagnosis (controls).
  • STRIVE blood samples were collected from women within 28 days of their screening mammogram.
  • cfDNA Cell-free DNA
  • the enriched bisulfite- converted nucleic acid molecules were sequenced using paired-end sequencing on an Illumina platform (San Diego, CA) to obtain a set of sequence reads for each of the training samples, and the resulting read pairs were aligned to the reference genome, assembled into fragments, and methylated and unmethylated CpG sites identified.
  • a probabilistic mixture model was trained and utilized to assign a probability to each fragment from each cancer and non-cancer sample based on how likely it was that the fragment would be observed in a given sample type.
  • the probabilistic model trained for each sample type was a mixture model, where each of three mixture components was an independent-sites model in which methylation at each CpG is assumed to be independent of methylation at other CpGs. Fragments were excluded from the model if: they had a p-value (from a non-cancer Markov model) greater than 0.01; were marked as duplicate fragments; the fragments had a bag size of greater than 1 (for targeted methylation samples only); they did not cover at least one CpG site; or if the fragment was greater than 1000 bases in length. Retained training fragments were assigned to a region if they overlapped at least one CpG from that region. If a fragment overlapped CpGs in multiple regions, it was assigned to all of them.
  • Each probabilistic model was fit using maximum-likelihood estimation to identify a set of parameters that maximized the log-likelihood of all fragments deriving from each sample type, subject to a regularization penalty.
  • each classification region a set of probabilistic models were trained, one for each training label (i.e., one for each cancer type and one for non-cancer).
  • Each model took the form of a Bernoulli mixture model with three components.
  • the product over i included only those positions for which a methylation state could be identified from the sequencing.
  • r is the regularization strength, which was set to 1.
  • each feature was defined by three properties: a genomic region; a“positive” cancer type label (excluding non-cancer); and the tier value selected from the set ( 1, 2, 3, 4, 5, 6, 7, 8, 9 ⁇ .
  • the numerical value of each feature was defined as the number of fragments in that region such that where the probabilities were defined by equation (1) using the maximum-likelihood-estimated parameter values corresponding to the“positive” cancer type (in the numerator of the logarithm) or to non-cancer (in the denominator).
  • the features were ranked using mutual information based on their ability to distinguish the first cancer type (which defined the log-likelihood model from which the feature was derived) from the second cancer type or non-cancer.
  • two ranked lists of features were compiled for each unique pair of class labels: one with the first label assigned as the“positive” and the second as the“negative”, and the other with the positive/negative assignment swapped (with the exception of the“non-cancer” label, which was only permitted as the negative label).
  • the fraction of training samples with non-zero feature value was calculated separately for the positive and negative labels.
  • the training samples were then divided into distinct 5-fold cross-validation training sets, and a two-stage classifier was trained for each fold, in each case training on 4/5 of the training samples and using the remaining 1/5 for validation.
  • a binary (two-class) logistic regression model for detecting the presence of cancer was trained to discriminate the cancer samples (regardless of TOO) from non-cancer.
  • a sample weight was assigned to the male non cancer samples to counteract sex -imbalance in the training set. For each sample, the binary classifier outputs a prediction score indicating the likelihood of a presence or absence of cancer.
  • the multi-class classifier outputs prediction values for the cancer types being classified, where each prediction value is a likelihood that the given sample has a certain cancer type.
  • the cancer classifier can return a cancer prediction for a test sample including a prediction score for breast cancer, a prediction score for lung cancer, and/or a prediction score for no cancer.
  • Scores assigned to the validation folds within the training set were retained for use in assigning cutoff values (thresholds) to target certain performance metrics.
  • the probability scores assigned to the training set non-cancer samples were used to define thresholds corresponding to particular specificity levels. For example, for a desired specificity target of 99.4%, the threshold was set at the 99.4th percentile of the cross-validated cancer detection probability scores assigned to the non-cancer samples in the training set. Training samples with a probability score that exceeded a threshold were called as positive for cancer.
  • a TOO or cancer type assessment was made from the multiclass classifier.
  • the multi-class logistic regression classifier assigned a set of probability scores, one for each prospective cancer type, to each sample.
  • the confidence of these scores was assessed as the difference between the highest and second-highest scores assigned by the multi-class classifier for each sample.
  • the cross-validated training set scores were used to identify the lowest threshold value such that of the cancer samples in the training set with top-two score differential exceeding the threshold, 90% had been assigned the correct TOO label as their highest score. In this way, the scores assigned to the validation folds during training were further used to determine a second threshold for distinguishing between confident and indeterminate TOO calls.
  • samples receiving a score from the binary (first-stage) classifier below the predefined specificity threshold were assigned a“non-cancer” label.
  • samples receiving a score from the binary (first-stage) classifier below the predefined specificity threshold were assigned a“non-cancer” label.
  • those whose top-two TOO-score differential from the second-stage classifier was below the second predefined threshold were assigned the“indeterminate cancer” label.
  • the remaining samples were assigned the cancer label to which the TOO classifier assigned the highest score.
  • the discriminatory value of the target genomic regions of Lists 16-32 was evaluated by testing the ability of a cancer classifier to detect cancer and any of 20 different cancer types according to the methylation status of these target genomic regions. Performance was evaluated over a set of 1,532 cancer samples and 1,521 non-cancer samples that were not used to train the classifier, as shown in TABLE 4. For each sample, differentially methylated cfDNA was enriched using a bait set comprising all of the target genomic regions of Lists 16-32. The classifier was then constrained to provide cancer determinations based only on the methylation status of the target genomic regions of the List being evaluated.
  • Results from the classifier performance analysis for lists 16-32 are presented in TABLES 5-8.
  • An exemplary receiver operator curve (ROC) generated by a trained classifier is shown in FIGURE 13.
  • the ROC shows true positive results and false positive results for a determination of cancer or no-cancer based on the methylation status of the target genomic regions of list 23, optimized for lung cancer.
  • the asymmetric shape of the ROC curve illustrates that the classifier was designed to minimize false positive results. Except for list 28 (renal cancer) the areas under the curve are tightly clustered between 0.77 and 0.80, as shown in TABLE 5.
  • classifier performance was tested for randomly selected 50% subsets of the target genomic regions of list 20 (colorectal cancer), list 23 (lung cancer) and list 26 (pancreas and gall bladder cancer).
  • the areas under the ROC curve for these subsets of target genomic regions were also tightly clustered between 0.77 and 0.80, indicating that a determination of cancer is not detectably compromised by using smaller panels of less than 400 - 700 target genomic regions having a total panel size of less than 75 - 140 kb.
  • the classifier assigns the cancer to one of twenty distinct cancer types.
  • the accuracy of these determinations with a specificity of 0.990 is presented in various formats.
  • TABLE 5 shows true positives, false positives, and false negatives as scored based on the methylation status of each list of target genomic regions optimized for the detection of a specific cancer type.
  • a true positive occurs when the presence of cancer is detected and the cancer type is accurately determined.
  • a false positive occurs for samples from individuals diagnosed with the cancer type that the list was optimized for when the presence of cancer is detected and an inaccurate cancer type is scored.
  • a false negative occurs for samples from individuals diagnosed with a different cancer type than the cancer type that the list was optimized for when the presence of cancer is detected and the cancer type is inaccurately scored as the cancer type for which the list was optimized.
  • the cancer type determination results are for the accuracy of determining all twenty cancer types, even though the lists of target genomic regions were optimized to detect a single cancer type.
  • a classifier considering the methylation status of the target genomic regions of list 16 accurately detected anorectal cancer for 50% (2 out of 4) of the samples collected from individuals diagnosed with stage I anorectal cancer.
  • An overall sensitivity for all cancer stages of >70% was achieved for anorectal cancer, head & neck cancer, liver & bile duct cancer, ovarian cancer, pancreatic & gallbladder cancer, and upper gastrointestinal tract cancer.
  • the sensitivity for detecting stage I + II cancers was >50% for anorectal cancer, bladder & urothelial cancer, head & neck cancer, liver & bile duct cancer, and pancreatic & gallbladder cancer.
  • Sensitivity based on the methylation status of a randomly selected 50% of the target genomic regions for colorectal cancer, lung cancer, or pancreatic and gall bladder cancer was essentially identical to sensitivity using 100% of the corresponding target genomic regions.
  • Blood samples are collected from a group of individuals previously diagnosed with cancer of a TOO (“test group”), and other groups of individuals without cancer or diagnosed with a different type of cancer (“other group”).
  • cfDNA fragments are extracted from the blood samples and treated with bisulfite to convert unmethylated cytosines to uracils.
  • the cancer assay panel described herein is applied to the bisulfite treated samples. Unbound cfDNA fragments are washed and cfDNA fragments bound to the probes are collected.
  • the collected cfDNA fragments are amplified and sequenced. The sequence reads confirm that the probes specifically enrich cfDNA fragments having methylation patterns indicative of cancer of a TOO and samples from the test group include significantly more of the differentially methylated cfDNA fragments compared to the other group.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Immunology (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Pathology (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Oncology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne un panel d'analyses sur le cancer pour la détection ciblée de motifs de méthylation spécifiques du cancer. L'invention concerne en outre des procédés de conception, de réalisation et d'utilisation de ce panel d'analyses sur le cancer pour détecter un tissu cancéreux d'origine (p. ex. des types de cancer).
PCT/US2020/016684 2019-02-05 2020-02-05 Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse Ceased WO2020163410A1 (fr)

Priority Applications (8)

Application Number Priority Date Filing Date Title
EP24204942.7A EP4502178A3 (fr) 2019-02-05 2020-02-05 Détection du cancer, tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse
CA3129043A CA3129043A1 (fr) 2019-02-05 2020-02-05 Detection d'un cancer, d'un tissu cancereux d'origine et/ou d'un type de cellule cancereuse
CN202080025351.1A CN114026255B (zh) 2019-02-05 2020-02-05 侦测癌症、癌症来源组织及/或一癌症细胞类型
ES20752248T ES2993312T1 (en) 2019-02-05 2020-02-05 Detecting cancer, cancer tissue of origin, and/or a cancer cell type
AU2020219853A AU2020219853A1 (en) 2019-02-05 2020-02-05 Detecting cancer, cancer tissue of origin, and/or a cancer cell type
EP20752248.3A EP3921444B1 (fr) 2019-02-05 2020-02-05 Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse
IL285316A IL285316A (en) 2019-02-05 2021-08-02 Detecting cancer, cancer tissue of origin, and/or a cancer cell type
US17/393,625 US20220098672A1 (en) 2019-02-05 2021-08-04 Detecting cancer, cancer tissue of origin, and/or a cancer cell type

Applications Claiming Priority (12)

Application Number Priority Date Filing Date Title
US201962801556P 2019-02-05 2019-02-05
US201962801561P 2019-02-05 2019-02-05
US62/801,556 2019-02-05
US62/801,561 2019-02-05
US202062965327P 2020-01-24 2020-01-24
US202062965342P 2020-01-24 2020-01-24
US62/965,327 2020-01-24
PCT/US2020/015082 WO2020154682A2 (fr) 2019-01-25 2020-01-24 Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse
USPCT/US2020/015082 2020-01-24
US62/965,342 2020-01-24
PCT/US2020/016673 WO2020163403A1 (fr) 2019-02-05 2020-02-04 Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse
USPCT/US2020/016673 2020-02-04

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/015082 Continuation-In-Part WO2020154682A2 (fr) 2019-01-25 2020-01-24 Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/393,625 Continuation US20220098672A1 (en) 2019-02-05 2021-08-04 Detecting cancer, cancer tissue of origin, and/or a cancer cell type

Publications (1)

Publication Number Publication Date
WO2020163410A1 true WO2020163410A1 (fr) 2020-08-13

Family

ID=71947082

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/016684 Ceased WO2020163410A1 (fr) 2019-02-05 2020-02-05 Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse

Country Status (1)

Country Link
WO (1) WO2020163410A1 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11410750B2 (en) 2018-09-27 2022-08-09 Grail, Llc Methylation markers and targeted methylation probe panel
IT202100021455A1 (it) * 2021-08-06 2023-02-06 Univ Degli Studi Cagliari Metodo per la diagnosi e/o prognosi del tumore delle vie biliari
PL440984A1 (pl) * 2022-04-20 2023-10-23 Uniwersytet Medyczny W Lublinie Sposób amplifikacji DNA w łańcuchowej reakcji polimerazy za pomocą starterów specyficznych dla genu ITGAM
WO2023225560A1 (fr) 2022-05-17 2023-11-23 Guardant Health, Inc. Procédés d'identification de cibles médicamenteuses et méthodes de traitement du cancer
WO2024112946A1 (fr) * 2022-11-22 2024-05-30 University Of Southern California Test de méthylation de l'adn acellulaire pour le cancer du sein
US12024750B2 (en) 2018-04-02 2024-07-02 Grail, Llc Methylation markers and targeted methylation probe panel
WO2024238698A3 (fr) * 2023-05-15 2025-02-20 The Regents Of The University Of California Système d'édition de gènes du facteur de transcription 4
WO2025222814A1 (fr) * 2024-04-22 2025-10-30 广州燃石医学检验所有限公司 Procédé de prédiction de relation d'association entre un échantillon et une tumeur, dispositif, support et programme

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070161031A1 (en) * 2005-12-16 2007-07-12 The Board Of Trustees Of The Leland Stanford Junior University Functional arrays for high throughput characterization of gene expression regulatory elements
WO2010037001A2 (fr) 2008-09-26 2010-04-01 Immune Disease Institute, Inc. Oxydation sélective de 5-méthylcytosine par des protéines de la famille tet
WO2011127136A1 (fr) 2010-04-06 2011-10-13 University Of Chicago Compositions et procédés liés à la modification de 5-hydroxyméthylcytosine (5-hmc)
US20130129668A1 (en) * 2011-09-01 2013-05-23 The Regents Of The University Of California Diagnosis and treatment of arthritis using epigenetics
WO2014026768A1 (fr) * 2012-08-14 2014-02-20 MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. Marqueurs de cancer colorectal
WO2014043763A1 (fr) 2012-09-20 2014-03-27 The Chinese University Of Hong Kong Détermination non invasive d'un méthylome du fœtus ou d'une tumeur à partir du plasma
US20160340740A1 (en) 2014-01-30 2016-11-24 The Regents Of The University Of California Methylation haplotyping for non-invasive diagnosis (monod)
US20190287652A1 (en) 2018-03-13 2019-09-19 Grail, Inc. Anomalous fragment detection and classification

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070161031A1 (en) * 2005-12-16 2007-07-12 The Board Of Trustees Of The Leland Stanford Junior University Functional arrays for high throughput characterization of gene expression regulatory elements
WO2010037001A2 (fr) 2008-09-26 2010-04-01 Immune Disease Institute, Inc. Oxydation sélective de 5-méthylcytosine par des protéines de la famille tet
WO2011127136A1 (fr) 2010-04-06 2011-10-13 University Of Chicago Compositions et procédés liés à la modification de 5-hydroxyméthylcytosine (5-hmc)
US20130129668A1 (en) * 2011-09-01 2013-05-23 The Regents Of The University Of California Diagnosis and treatment of arthritis using epigenetics
WO2014026768A1 (fr) * 2012-08-14 2014-02-20 MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. Marqueurs de cancer colorectal
WO2014043763A1 (fr) 2012-09-20 2014-03-27 The Chinese University Of Hong Kong Détermination non invasive d'un méthylome du fœtus ou d'une tumeur à partir du plasma
US20160340740A1 (en) 2014-01-30 2016-11-24 The Regents Of The University Of California Methylation haplotyping for non-invasive diagnosis (monod)
US20190287652A1 (en) 2018-03-13 2019-09-19 Grail, Inc. Anomalous fragment detection and classification

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
"AJCC Cancer Staging Manual", 2017, SPRINGER
DATABASE 0 [online] 7 March 2003 (2003-03-07), GENBANK ET AL., XP055728940, Database accession no. AC067721 *
DATABASE GenBank [online] 25 June 2002 (2002-06-25), XP055728937, Database accession no. AC093151.2 *
LIU, L. ET AL.: "Targeted methylation sequencing of plasma cell-free DNA for cancer detection and classification", ANNALS OF ONCOLOGY, vol. 29, no. 6, 1 June 2018 (2018-06-01), XP055910054, DOI: 10.1093/annonc/mdy119
PIDSLEY, R. ET AL.: "Critical evaluation of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation profiling", GENOME BIOLOGY, vol. 17, 2016, pages 208, XP055932440, DOI: 10.1186/s13059-016-1066-1
RIEDMILLER MBRAUN H.: "RPROP - A Fast Adaptive Learning Algorithm", PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCE, 1992
See also references of EP3921444A4

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12024750B2 (en) 2018-04-02 2024-07-02 Grail, Llc Methylation markers and targeted methylation probe panel
US12435375B2 (en) 2018-04-02 2025-10-07 Grail, Inc. Methylation markers and targeted methylation probe panel
US11685958B2 (en) 2018-09-27 2023-06-27 Grail, Llc Methylation markers and targeted methylation probe panel
US12410482B2 (en) 2018-09-27 2025-09-09 Grail, Inc. Methylation markers and targeted methylation probe panel
US11725251B2 (en) 2018-09-27 2023-08-15 Grail, Llc Methylation markers and targeted methylation probe panel
US11795513B2 (en) 2018-09-27 2023-10-24 Grail, Llc Methylation markers and targeted methylation probe panel
US11410750B2 (en) 2018-09-27 2022-08-09 Grail, Llc Methylation markers and targeted methylation probe panel
WO2023012683A1 (fr) * 2021-08-06 2023-02-09 Università Degli Studi Di Cagliari Procédé de diagnostic et/ou de pronostic du cancer des voies biliaires
US12195809B2 (en) 2021-08-06 2025-01-14 Università Degli Studi Di Cagliari Method for the diagnosis and/or prognosis of cancer of the biliary tract
IT202100021455A1 (it) * 2021-08-06 2023-02-06 Univ Degli Studi Cagliari Metodo per la diagnosi e/o prognosi del tumore delle vie biliari
PL440984A1 (pl) * 2022-04-20 2023-10-23 Uniwersytet Medyczny W Lublinie Sposób amplifikacji DNA w łańcuchowej reakcji polimerazy za pomocą starterów specyficznych dla genu ITGAM
PL245974B1 (pl) * 2022-04-20 2024-11-12 Univ Medyczny W Lublinie Sposób amplifikacji DNA w łańcuchowej reakcji polimerazy za pomocą starterów specyficznych dla genu ITGAM
WO2023225560A1 (fr) 2022-05-17 2023-11-23 Guardant Health, Inc. Procédés d'identification de cibles médicamenteuses et méthodes de traitement du cancer
WO2024112946A1 (fr) * 2022-11-22 2024-05-30 University Of Southern California Test de méthylation de l'adn acellulaire pour le cancer du sein
WO2024238698A3 (fr) * 2023-05-15 2025-02-20 The Regents Of The University Of California Système d'édition de gènes du facteur de transcription 4
WO2025222814A1 (fr) * 2024-04-22 2025-10-30 广州燃石医学检验所有限公司 Procédé de prédiction de relation d'association entre un échantillon et une tumeur, dispositif, support et programme

Similar Documents

Publication Publication Date Title
EP3914736B1 (fr) Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse
US12435375B2 (en) Methylation markers and targeted methylation probe panel
EP3921444B1 (fr) Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse
CN113826167B (zh) 基于模型的特征化和分类
EP3856903A1 (fr) Marqueurs de méthylation et panels de sondes de méthylation ciblées
WO2020163410A1 (fr) Détection d'un cancer, d'un tissu cancéreux d'origine et/ou d'un type de cellule cancéreuse
TW202436626A (zh) 基於模型的特徵化及分類之最佳化
HK40065713B (en) Detecting cancer, cancer tissue of origin, and/or a cancer cell type
HK40065713A (en) Detecting cancer, cancer tissue of origin, and/or a cancer cell type
HK40063166A (en) Detecting cancer, cancer tissue of origin, and/or a cancer cell type
HK40065120A (en) Detecting cancer, cancer tissue of origin, and/or a cancer cell type
HK40065348A (en) Detecting cancer, cancer tissue of origin, and/or a cancer cell type

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20752248

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3129043

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020219853

Country of ref document: AU

Date of ref document: 20200205

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020752248

Country of ref document: EP

Effective date: 20210906

WWG Wipo information: grant in national office

Ref document number: 202080025351.1

Country of ref document: CN